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Running  title:  Efficient  Structure  Learning  of  Bayesian  Nets 

Abstract 

This  paper  addresses  the  problem  of  learning  Bayesian  network  structures  from  data  based 
on  score  functions  that  are  decomposable.  It  describes  properties  that  strongly  reduce  the 
time  and  memory  costs  of  many  known  methods  without  losing  global  optimality  guaran¬ 
tees.  These  properties  are  derived  for  different  score  criteria  such  as  Minimum  Description 
Length  (or  Bayesian  Information  Criterion),  Akaike  Information  Criterion  and  Bayesian 
Dirichlet  Criterion.  Then  a  branch-and-bound  algorithm  is  presented  that  integrates  struc¬ 
tural  constraints  with  data  in  a  way  to  guarantee  global  optimality.  As  an  example,  struc¬ 
tural  constraints  are  used  to  map  the  problem  of  structure  learning  in  Dynamic  Bayesian 
networks  into  a  corresponding  augmented  Bayesian  network.  Finally,  we  show  empirically 
that  the  new  algorithm,  as  well  as  state-of-the-art  methods,  can  handle  larger  data  sets 
with  the  use  of  the  properties  than  those  currently  possible  without  them. 

Keywords:  Bayesian  networks,  structure  learning,  properties  of  decomposable  scores, 

structural  constraints,  branch-and-bound  technique 


1.  Introduction 

A  Bayesian  network  is  a  probabilistic  graphical  model  that  relies  on  a  structured  depen¬ 
dency  among  random  variables  to  represent  a  joint  probability  distribution  in  a  compact 
and  efficient  manner.  It  is  composed  by  a  directed  acyclic  graph  (DAG)  where  nodes  are 
associated  to  random  variables  and  conditional  probability  distributions  are  defined  for  vari¬ 
ables  given  their  parents  in  the  graph.  Learning  the  graph  (or  structure)  of  these  networks 
from  data  is  one  of  the  most  challenging  problems,  even  if  data  are  complete.  The  prob¬ 
lem  is  known  to  be  NP-hard  (Chickering  et  al.,  2004),  and  best  exact  known  methods  take 
exponential  time  on  the  number  of  variables  and  are  applicable  to  small  settings  (around 
30  variables).  Approximate  procedures  can  handle  larger  networks,  but  usually  they  get 
stuck  in  local  maxima.  Nevertheless,  the  quality  of  the  structure  plays  a  crucial  role  in 


©2010  Cassio  P.  de  Campos  and  Qiang  Ji. 


de  Campos  and  Ji 


the  accuracy  of  the  model.  If  the  dependency  among  variables  is  not  properly  learned,  the 
estimated  distribution  may  be  far  from  the  correct  one. 

In  general  terms,  the  problem  is  to  find  the  best  structure  (DAG)  according  to  some  score 
function  that  depends  on  the  data  (Heckerman  et  al.,  1995).  There  are  methods  based  on 
other  (local)  statistical  analysis  (Spirtes  et  al.,  1993),  but  they  follow  a  completely  different 
approach.  The  research  on  this  topic  is  active  (Chickering,  2002;  Teyssier  and  Roller,  2005; 
Tsamardinos  et  al.,  2006;  Silander  and  Myllymaki,  2006;  Parviainen  and  Koivisto,  2009; 
de  Campos  et  al.,  2009;  Jaakkola  et  al.,  2010),  mostly  focused  on  complete  data.  In  this 
case,  best  exact  ideas  (where  it  is  guaranteed  to  find  the  global  best  scoring  structure) 
are  based  on  dynamic  programming  (Koivisto  and  Sood,  2004;  Singh  and  Moore,  2005; 
Koivisto,  2006;  Silander  and  Myllymaki,  2006;  Parviainen  and  Koivisto,  2009),  and  they 
spend  time  and  memory  proportional  to  n  •  2n,  where  n  is  the  number  of  variables.  Such 
complexity  forbids  the  use  of  those  methods  to  a  couple  of  tens  of  variables,  mainly  because 
of  the  memory  consumption  (even  though  time  complexity  is  also  a  clear  issue).  Ott  and 
Miyano  (2003)  devise  a  faster  algorithm  when  the  complexity  of  the  structure  is  limited 
(for  instance  the  maximum  number  of  parents  per  node  and  the  degree  of  connectivity  of  a 
subjacent  graph).  Perrier  et  al.  (2008)  use  structural  constraints  (creating  a  super-structure 
from  which  the  optimal  must  be  a  subgraph)  to  reduce  the  search  space,  showing  that  such 
direction  is  promising  when  one  wants  to  learn  structures  of  large  data  sets.  Kojirna  et  al. 
(2010)  extend  the  same  ideas  with  other  types  of  constraints.  Mostly  these  methods  are 
based  on  improving  the  dynamic  programming  method  to  work  over  reduced  search  spaces. 
On  a  different  front,  Jaakkola  et  al.  (2010)  apply  a  linear  programming  relaxation  to  solve 
the  problem,  together  with  a  branch-and-bound  search.  Branch-and-bound  methods  can 
be  effective  when  good  bounds  and  cuts  are  available.  For  example,  this  has  happened 
with  certain  success  in  the  Traveling  Salesman  Problem  (Applegate  et  al.,  2006).  We  have 
proposed  an  algorithm  that  also  uses  branch  and  bound,  but  employs  a  different  technique 
to  find  bounds  (de  Campos  et  al.,  2009).  It  has  been  showed  that  branch  and  bound  methods 
can  handle  somewhat  larger  networks  than  the  dynamic  programming  ideas.  The  method 
is  described  in  detail  in  Section  5. 

In  the  first  part  of  this  paper,  we  present  structural  constraints  as  a  way  to  reduce 
the  search  space.  We  explore  the  use  of  constraints  to  devise  methods  to  learn  special¬ 
ized  versions  of  Bayesian  networks  (such  as  naive  Bayes  and  Tree-augmented  naive  Bayes) 
and  generalized  versions,  such  as  Dynamic  Bayesian  networks  (DBNs).  DBNs  are  used  to 
model  temporal  processes.  We  describe  a  procedure  to  map  the  structural  learning  problem 
of  a  DBN  into  a  corresponding  augmented  Bayesian  network  through  the  use  of  further 
constraints,  so  that  the  same  exact  algorithm  we  discuss  for  Bayesian  networks  can  be 
employed  for  DBNs. 

In  the  second  part,  we  present  some  properties  of  the  problem  that  bring  a  considerable 
improvement  on  many  known  methods.  We  build  on  our  recent  work  (de  Campos  et  al., 
2009)  on  Akaike  Information  Criterion  (AIC)  and  Bayesian  Information  Criterion  (BIC), 
and  present  new  results  for  the  Bayesian  Dirichlet  (BD)  criterion  (Cooper  and  Herskovits, 
1992)  and  some  derivations  under  a  few  assumptions.  We  show  that  the  search  space  of 
possible  structures  can  be  reduced  drastically  without  losing  the  global  optimality  guarantee 
and  that  the  memory  requirements  are  very  small  in  many  practical  cases. 
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As  data  sets  with  many  variables  cannot  be  efficiently  handled  (unless  P=NP),  a  desired 
property  of  a  learning  method  is  to  produce  an  any-time  solution,  that  is,  the  procedure,  if 
stopped  at  any  moment,  provides  an  approximate  solution,  while  if  run  until  it  finishes,  a 
global  optimum  solution  is  found.  However,  the  most  efficient  exact  methods  are  not  any¬ 
time.  We  describe  an  any-time  exact  algorithm  using  a  branch-and-bound  (B&B)  approach 
with  caches.  Scores  are  pre-computed  during  an  initialization  step  to  save  computational 
time.  Then  we  perform  the  search  over  the  possible  graphs  iterating  over  arcs.  Because  of 
the  B&B  properties,  the  algorithm  can  be  stopped  with  a  best  current  solution  and  an  upper 
bound  to  the  global  optimum,  which  gives  a  certificate  to  the  answer  and  allows  the  user  to 
stop  the  computation  when  she/he  believes  that  the  current  solution  is  good  enough.  For 
example,  such  an  algorithm  can  be  integrated  with  a  structural  Expectation-Maximization 
(EM)  method  without  the  huge  computational  expenses  of  other  exact  methods  by  using 
the  generalized  EM  (where  finding  an  improving  solution  is  enough),  but  still  guaranteeing 
that  a  global  optimum  is  found  if  run  until  the  end.  Due  to  this  property,  the  only  source 
of  approximation  would  regard  the  EM  method  itself.  It  worth  noting  that  using  a  B&B 
method  is  not  new  for  structure  learning  (Suzuki,  1996).  Still,  that  previous  idea  does  not 
constitute  a  global  exact  algorithm,  instead  the  search  is  conducted  after  a  node  ordering 
is  fixed.  Our  method  does  not  rely  on  a  predefined  ordering  and  finds  a  global  optimum 
structure  considering  all  possible  orderings. 

The  paper  is  divided  as  follows.  Section  2  describes  the  notation  and  introduces  Bayesian 
networks  and  the  structure  learning  problem  based  on  score  functions.  Section  3  presents 
the  structural  constraints  that  are  treated  in  this  work,  and  shows  examples  on  how  they 
can  be  used  to  learn  different  types  of  networks.  Section  4  presents  important  properties  of 
the  score  functions  that  considerably  reduce  the  memory  and  time  costs  of  many  methods. 
Section  5  details  our  branch-and-bound  algorithm,  while  Section  6  shows  experimental 
evaluations  of  the  properties,  the  constraints  and  the  exact  method.  Finally,  Section  7 
concludes  the  paper. 

2.  Bayesian  networks 

A  Bayesian  network  represents  a  joint  probability  distribution  over  a  collection  of  random 
variables,  which  we  assume  to  be  categorical.  It  can  be  defined  as  a  triple  (Q,  A,  V) ,  where 
Q  =  ( Vg,Eg )  is  a  directed  acyclic  graph  (DAG)  with  Vg  a  collection  of  n  nodes  associated 
to  random  variables  X  (a  node  per  variable),  and  Eg  a  collection  of  arcs;  V  is  a  collection 
of  conditional  mass  functions  p(Aj|n?:)  (one  for  each  instantiation  of  n,;),  where  n*  denotes 
the  parents  of  Xt  in  the  graph  (n*  may  be  empty),  respecting  the  relations  of  Eg.  In  a 
Bayesian  network  every  variable  is  conditionally  independent  of  its  non-descendants  given 
its  parents  (Markov  condition). 

We  use  uppercase  letters  such  as  Xi,Xj  to  represent  variables  (or  nodes  of  the  graph, 
which  are  used  interchanged),  and  Xi  to  represent  a  generic  state  of  Xt.  which  has  state 
space  itx,  =  {xn,Xi2,  ■ . . ,  Xin},  where  r,;  =  |flxj  >  2  is  the  number  of  (finite)  categories  of 
Xi  ( |  •  |  is  the  cardinality  of  a  set  or  vector,  and  the  notation  =  is  used  to  indicate  a  definition 
instead  of  a  mathematical  equality).  Bold  letters  are  used  to  emphasize  sets  or  vectors.  For 
example,  x  £  fix  =  for  X  C  X,  is  an  instantiation  for  all  the  variables  in  X. 

Furthermore,  rn;  =  |fln;|  =  Ilxten  r<  is  the  number  of  possible  instantiations  of  the  parent 
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set  Ilj  of  Xi,  and  6  =  ( 9ijk)vijk  is  the  entire  vector  of  parameters  such  that  the  elements 
are  9ijk  =  p(xik\'Kij),  with  i  £  {1, . . . ,  n},  j  £  {1,  ...,rnJ,  k  £  {1,  and  £  On,  • 

Because  of  the  Markov  condition,  the  Bayesian  network  represents  a  joint  probability 
distribution  by  the  expression  p(x)  =  p(x i, . . . ,  xn)  =  Y\iP{xi\TZi),  f°r  every  x  £  fix,  where 
every  xt  and  7 n  are  consistent  with  x. 

Given  a  complete  data  set  D  =  {D i, . . . ,  D n}  with  N  instances,  where  Du  =  xu  £  17^ 
is  an  instantiation  of  all  the  variables,  the  goal  of  structure  learning  is  to  find  a  DAG  G 
that  maximizes  a  given  score  function,  that  is,  we  look  for  Q*  =  argmaxggg  sd{G),  with 
Q  the  set  of  all  DAGs  with  nodes  A,  for  a  given  score  function  sd  (the  dependency  on 
data  is  indicated  by  the  subscript  D).1  In  this  paper,  we  consider  some  well-known  score 
functions:  the  Bayesian  Information  Criterion  (BIC)  (Schwarz,  1978)  (which  is  equivalent  to 
the  Minimum  Description  Length ),  the  Akaike  Information  Criterion  (AIC)  (Akaike,  1974), 
and  the  Bayesian  Dirichlet  (BD)  (Cooper  and  Herskovits,  1992),  which  has  as  subcases  BDe 
and  BDeu  (Buntine,  1991;  Cooper  and  Herskovits,  1992;  Heckerman  et  al.,  1995).  As  done 
before  in  the  literature,  we  assume  parameter  independence  and  modularity  (Heckerman 
et  al.,  1995).  The  score  functions  based  on  BIC  and  AIC  differ  only  in  the  weight  that  is 
given  to  the  penalty  term: 

BIC/ AIC  :  sd{G )  =  maxLgiD(0)  —  t(G)  ■  w, 

0 

where  t(Q )  =  i(rri;  ■  (j't  —  1))  is  the  number  of  free  parameters,  w  =  for  BIC  and 
w  =  1  for  AIC,  LgtD  is  the  log-likelihood  function  with  respect  to  data  D  and  graph  Q: 

n  rn  i  n 

re.D(0>=iognnn'£r.  « 

i=  1  j=  1  k= 1 

where  nijk  indicates  how  many  elements  of  D  contain  both  xlk  and  7T,j .  Note  that  the 
values  ( nijk)yijk  depend  on  the  graph  Q  (more  specifically,  they  depend  on  the  parent  set 
n*  of  each  Xi),  so  a  more  precise  notation  would  be  to  use  n"),  instead  of  n^k-  We  avoid 
this  heavy  notation  for  simplicity  unless  necessary  in  the  context.  Moreover,  we  know  that 
e*  =  (0*jk)vijk  =  {% xL)vijfc  =  argmaxg LgtD{6),  with  mj  =  J2knijk-2 

In  the  case  of  the  BD  criterion,  the  idea  is  to  compute  a  score  based  on  the  posterior 
probability  of  the  structure  p(Q\D).  For  that  purpose,  the  following  score  function  is  used: 

BD :  sd{G)  =  log  (p(G)  ■  J p(D\g ,  6)  ■  P(e\g)dej  , 


where  the  logarithmic  is  often  used  to  simplify  computations,  p(6 \g)  is  the  prior  of  6  for  a 
given  graph  Q,  assumed  to  be  a  Dirichlet  with  hyper-parameters  a  =  ( aijk)vijk  (which  are 
assumed  to  be  strictly  positive): 


rn,- 


n 


P(o\g)  =  UU^U 


i= 1  j= 1 


fe= 1 


a  aijh  1 

aijk 

r (Ctijk)  ’ 


1.  Ill  case  of  many  optimal  DAGs,  then  we  assume  to  have  no  preference  and  argmax  returns  one  of  them. 

2.  If  riij  =  0,  then  riijk  =  0  and  we  assume  the  fraction  to  be  equal  to  one. 
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where  oqj  =  J2kaijk-  Hyper-parameters  ( anjk)\Hjk  also  depend  on  the  graph  Q,  and  we 
indicate  it  by  a^k  if  necessary  in  the  context.  From  now  on,  we  also  omit  the  subscript  D. 
We  assume  that  there  is  no  preference  for  any  graph,  so  p{G)  is  uniform  and  vanishes  in  the 
computations.  Under  the  assumptions,  it  has  been  shown  (Cooper  and  Herskovits,  1992) 
that  for  multinomial  distributions, 


rn. 


s(G)  =  log  Jj  Yl 


T(a 


v) 


i=i  j= i  +  nv) 


n 

k= 1 


T(aijk) 


(2) 


The  BDe  score  (Heckerman  et  al.,  1995)  assumes  that  a.ijk  =  a*  •  p{9ijk\G),  where  a*  is 
the  hyper-parameter  known  as  the  Equivalent  Sample  Size  (ESS),  and  p{9ijk\G)  is  the  prior 
probability  for  (x^  A  7Tjj)  given  Q  (or  simply  given  Hj).  The  BDeu  score  (Buntine,  1991; 
Cooper  and  Herskovits,  1992)  assumes  further  that  local  priors  are  such  that  otijk  becomes 
a  and  a*  is  the  only  free  hyper-parameter. 

rUi 

An  important  property  of  all  such  criteria  is  that  their  functions  are  decomposable  and 
can  be  written  in  terms  of  the  local  nodes  of  the  graph,  that  is,  s(Q)  =  si(Hj),  such 

that 

BIC/AIC:  Si(Ui)  =  max LuiiO-i)  -  ■  w,  (3) 

o, 

where  Tip^i)  =  Yj3'=\  YJk= i  log  <%fc,  and  ti(Ui)  =  rUi  ■  (n  -  1).  And  similarly, 


BD  : 


rni  / 

si(Hj)  =  X]  (  ^g 

j= i  v 


H (pt-ij  T  Tlij) 


+ y  log  r(abfc  +  nijk)  \ 


(4) 


In  the  case  of  BIC  and  AIC,  Equation  (3)  is  used  to  compute  the  global  score  of  a  graph 
using  the  local  scores  at  each  node,  while  Equation  (4)  is  employed  for  BD,  BDe  and  BDeu, 
using  the  respective  hyper-parameters  a. 


3.  Structural  Constraints 

A  way  to  reduce  the  space  of  possible  DAGs  is  to  consider  some  constraints  provided  by 
experts.  We  work  with  structural  constraints  that  specify  where  arcs  may  or  may  not  be 
included.  These  constraints  help  to  reduce  the  search  space  and  are  available  in  many 
situations.  Moreover,  we  show  examples  in  Sections  3.1  and  3.2  of  how  these  constraints 
can  be  used  to  learn  structures  of  different  types  of  networks,  such  as  naive  Bayes,  tree- 
augmented  naive  Bayes,  and  Dynamic  Bayesian  networks.  We  work  with  the  following  rules, 
used  to  build  up  the  structural  constraints: 

•  indegree(Xj,k,op ),  where  op  £  {It, eq}  and  k  an  integer,  means  that  the  node  Xj 
must  have  less  than  (when  op  =  It)  or  equal  to  (when  op  =  eq)  k  parents. 

•  arc(Xi,  Xj)  indicates  that  the  node  X%  must  be  a  parent  of  Xj. 

•  Operators  or  (V)  and  not  (-■)  are  used  to  form  the  rules.  The  and  operator  is  not 
explicitly  used  as  we  assume  that  each  constraint  is  in  disjunctive  normal  form. 
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The  structural  constraints  can  be  imposed  locally  as  long  as  they  involve  just  a  single 
node  and  its  parents.  In  essence,  parent  sets  of  a  node  that  do  violate  a  constraint  are 
never  processed  nor  stored,  and  this  can  be  checked  locally  when  one  is  about  to  compute 
the  local  score.  On  the  other  hand,  constraints  such  as  {arc{X i,  X2)  V  arc(X 2,  A3))  cannot 
be  imposed  locally,  as  it  defines  a  non-local  condition  (the  arcs  go  to  distinct  variables, 
namely  X2  and  A3).  In  this  work  we  assume  that  constraints  are  local.  Besides  constraints 
devised  by  an  expert,  one  might  use  constraints  to  force  the  learning  procedure  to  obtain 
specialized  types  of  networks.  The  next  two  subsections  describe  (somewhat  non-trivial) 
examples  of  use  of  constraints  to  learn  different  types  of  networks.  Specialized  networks 
tend  to  be  easier  to  learn,  because  the  search  space  is  already  reduced  to  the  structures  that 
satisfy  the  underlying  constraints.  Notwithstanding,  the  readers  who  are  only  interested  in 
learning  general  Bayesian  networks  might  want  to  skip  the  rest  of  this  section  and  continue 
from  Section  4. 

3.1  Learning  Naive  and  TAN  structures 

For  example,  the  constraints  Vj^cj^c  -1  arc(Xi,  Xj)  and  indegree(Xc,  0,  eq)  impose  that 
only  arcs  from  node  Xc  to  the  others  are  possible,  and  that  Xc  is  a  root  node,  that  is, 
a  Naive  Bayes  structure  will  be  learned.  A  learning  procedure  would  in  fact  act  as  a 
feature  selection  procedure  by  letting  some  variables  unlinked.  Note  that  the  symbol  V 
just  employed  is  not  part  of  the  language  but  is  used  for  easy  of  expose  (in  fact  it  is 
necessary  to  write  down  every  constraint  defined  by  such  construction).  As  another  example, 
the  constraints  Vjyc  indegree(Xj,  3,  It),  indegree(Xc,  0,  eq),  and  indegree(Xj,  0,  eq)  V 
arc{Xc,  Xj)  ensure  that  all  nodes  have  Xc  as  parent,  or  no  parent  at  all.  Besides  Ac,  each 
node  may  have  at  most  one  other  parent,  and  Xc  is  a  root  node.  This  learns  the  structure  of 
a  Tree-augmented  Naive  (TAN)  classifier,  also  performing  a  kind  of  feature  selection  (some 
variables  may  end  up  unlinked).  In  fact,  it  learns  a  forest  of  trees,  as  we  have  not  imposed 
that  all  variables  must  be  linked.  In  Section  6  we  present  some  experimental  results  which 
indicate  that  learning  TANs  is  a  much  easier  (still  very  important)  practical  situation. 

We  point  out  that  learning  structures  of  networks  with  the  particular  purpose  of  building 
a  classifier  can  be  also  tackled  by  other  score  functions  that  consider  conditional  distribu¬ 
tions  (Pernkopf  and  Bilrnes,  2005).  Here  we  present  a  way  to  learn  TANs  considering  the 
fit  of  the  joint  distribution,  which  can  be  done  by  constraints.  Further  discussions  about 
learning  classifiers  is  not  the  aim  of  this  work. 

3.2  Learning  Dynamic  Bayesian  Networks 

A  more  sophisticated  application  of  structural  constraints  is  presented  in  this  section, 
where  they  are  employed  to  translate  the  structure  learning  in  Dynamic  Bayesian  Net¬ 
works  (DBNs)  to  a  corresponding  problem  in  Bayesian  networks.  While  Bayesian  networks 
are  not  directly  related  to  time,  DBNs  are  used  to  model  temporal  processes.  Assuming 
Markovian  and  stationary  properties,  DBNs  may  be  encoded  in  a  very  compact  way  and 
inferences  are  executed  quickly.  They  are  built  over  a  collection  of  sets  of  random  vari¬ 
ables  {A0,  A1, . . . ,  XT}  representing  variables  in  different  times  0,1, ...  ,T  (we  assume  that 
time  is  discrete).  A  Markovian  property  holds,  which  ensures  that  p(Xt+1\ A0, . . .  X1)  = 
p(Xt+1\Xt),  for  0  <  t  <  T.  Furthermore,  because  the  process  is  assumed  to  be  stationary, 
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we  have  that  p(Xt+1  \Xf)  is  independent  of  t,  that  is,  p(Xt+1 IX1)  =  p(Xt,+1  {X1')  for  any 
0  <  t,t'  <  T.  This  means  that  a  DBN  is  just  as  a  collection  of  Bayesian  networks  that 
share  the  same  structure  and  parameters  (apart  from  the  initial  Bayesian  network  for  time 
zero).  If  Xj  £  Xt  are  the  variables  at  time  t,  a  DBN  may  have  arcs  between  nodes  Xj  of 
the  same  time  t  and  arcs  from  nodes  Xj~  (previous  time)  to  nodes  Xj  of  time  t.  Hence,  a 
DBN  can  be  viewed  as  two-slice  temporal  Bayesian  network,  where  at  time  zero,  we  have 
a  standard  Bayesian  network  as  in  Section  2,  which  we  denote  B°,  and  for  slices  1  to  T  we 
have  another  Bayesian  network  (called  transitional  Bayesian  network  and  denoted  simply 
B)  defined  over  the  same  variables  but  where  nodes  may  have  parents  on  two  consecutive 
slices,  that  is,  B  precisely  defines  the  distributions  p(Xt+1  \Xf),  for  any  0  <  t  <  T. 

To  learn  a  DBN,  we  assume  that  many  temporal  sequences  of  data  are  available.  Thus, 
a  complete  data  set  D  =  {D\, . . .  ,Djy}  is  composed  of  N  sequences,  where  each  Du  is 
composed  of  instances  D*  =  =  { x fu  lr..,i*J,  for  t  =  0, . . . ,  T  (where  T  is  the  total 

number  of  slices/frames  apart  from  the  initial  one).  Note  that  there  is  an  implicit  order 
among  the  elements  of  each  Du.  We  denote  by  D°  =  {D®  :  1  <  u  <  N}  the  data  of  the  first 
slice,  and  by  Dt  =  {(D* ,  Dj^1)  :  1  <  u  <  N},  with  1  <  t  <  T,  the  data  of  a  slice  t  (note  that 
the  data  of  the  slice  i — 1  is  also  included,  because  it  is  necessary  for  learning  the  transitions). 
As  the  conditional  probability  distributions  for  time  t  >  0  share  the  same  parameters,  we 
can  unroll  the  DBN  to  obtain  the  factorization  p(X1-T)  =  \\,p{)  Y\J=1  W^pjXjjH^) , 

where  p°(X?|nJ?)  are  the  local  conditional  distributions  of  B°,  Xj  and  n(  represent  the 
corresponding  variables  in  time  t,  and  p(Xj\Hti)  are  the  local  distributions  of  B. 

Unfortunately  learning  a  DBN  is  at  least  as  hard  as  learning  a  Bayesian  network,  because 
the  former  can  be  viewed  as  a  generalization  of  the  latter.  Still,  we  show  that  the  same 
method  used  for  Bayesian  networks  can  be  used  to  learn  DBNs.  With  complete  data, 
learning  parameters  of  DBNs  is  similar  to  learning  parameters  of  Bayesian  networks,  but 
we  deal  with  counts  for  both  B°  and  B.  The  counts  related  to  £>°  are  obtained  from  the 
first  slice  of  each  sequence,  so  there  are  N  samples  overall,  while  counts  for  B  are  obtained 
from  the  whole  time  sequences,  so  there  are  N  T  elements  to  consider  (supposing  that  each 
sequence  has  the  same  length  T,  for  ease  of  expose).  The  score  function  of  a  given  structure 
decomposes  between  the  score  function  of  B°  and  the  score  function  of  B  (because  of  the 
decomposability  of  score  functions),  so  we  look  for  graphs  such  that 


=  argrnax  (sDo(G°)  +  sdi-.t(G'))  =  (argrnax  sDo(G°),  argrnax  sdi.t(G')),  (5) 

g°,g'  g°  g1 


where  G°  is  a  graph  over  X°  and  G'  is  a  graph  over  variables  X Xt_1  of  a  generic  slice  t  and 
its  predecessor  t  —  1.  Counts  are  obtained  from  data  sets  with  time  sequences  separately  for 
the  initial  and  the  transitional  Bayesian  networks,  and  the  problem  reduces  to  the  learning 
problem  in  a  Bayesian  network  with  some  constraints  that  force  the  arcs  to  respect  the 
DBN’s  stationarity  and  Markovian  characteristics  (of  course,  it  is  necessary  to  obtain  the 
counts  from  the  data  in  a  particular  way).  We  make  use  of  the  constraints  defined  in  Section 
3  to  develop  a  simple  transformation  of  the  structure  learning  problem  to  a  corresponding 
structure  learning  problem  in  an  augmented  Bayesian  network.  The  steps  of  this  procedure 
are  as  follows: 
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1.  Learn  B°  using  the  data  set  D°.  Note  that  this  is  already  a  standard  Bayesian  network 
structure  learning  problem,  so  we  obtain  the  graph  Q°  for  the  first  maximization  of 
Equation  (5). 

2.  Suppose  there  is  a  Bayesian  network  B'  =  (Q1,  X' ,  V')  with  twice  as  many  nodes  as  B°. 
Denote  the  nodes  as  (X\, . . . ,  Xn,  X{, . . . ,  X'n).  Construct  a  new  data  set  D'  that  is 
composed  by  N-T  elements  {D1, . . . ,  DT}.  Note  that  D'  is  precisely  a  data  set  over  2 n 
variables,  because  it  is  formed  of  pairs  (D*_1,  D^),  which  are  complete  instantiations 
for  the  variables  of  B',  containing  the  elements  of  two  consecutive  slices. 

3.  Include  structural  constraints  as  follows: 

Vi <i<n  arc(Xi,X[)  (6) 

Vi<j<n  indegree(Xi,  0,  eg)  (7) 

Equation  (6)  forces  the  time  relation  between  the  same  variable  in  consecutive  time 
slices  (in  fact  this  constraint  might  be  discarded  if  someone  does  not  want  to  enforce 
each  variable  to  be  correlated  to  itself  of  the  past  slice).  Equation  (7)  forces  the 
variables  X\, . . . ,  Xn  to  have  no  parents  (these  are  the  variables  that  are  simulating 
the  previous  slice,  while  the  variables  X1  are  simulating  the  current  slice). 

4.  Learn  B'  using  the  data  set  D'  with  an  standard  Bayesian  network  structure  learning 
procedure,  capable  of  enforcing  the  structural  constraints.  Note  that  the  parent  sets 
of  X\ , . . . ,  Xn  are  already  fixed  to  be  empty,  so  the  output  graph  will  maximize  the 
scores  associated  only  to  nodes  X'\  argmaxg,  sdi-.t{G'))  = 

argrnax  V  s^ct^)  +  V  £,i;t(II')  =  argmaxV  Si,  £,i:t(II'). 

XT  V  J  S'  V 

This  holds  because  of  the  decomposability  of  the  score  function  among  nodes,  so  that 
the  scores  of  the  nodes  X\, . . . ,  Xn  are  fixed  and  can  be  disregarded  in  the  maximiza¬ 
tion  (they  are  constant). 

5.  Finally,  we  take  the  subgraph  of  Q'  corresponding  to  the  variables  X[,... ,  X'n  to  be 
the  graph  of  the  transitional  Bayesian  network  B.  This  subgraph  has  arcs  among 
X[ , . . . ,  X'n  (which  are  arcs  correlating  variables  of  the  same  time  slice)  as  well  as  arcs 
from  the  previous  slice  to  the  nodes  X[, . . . ,  X'n. 

Therefore,  after  applying  this  transformation,  the  structure  learning  problem  in  a  DBN 
can  be  performed  by  two  calls  to  the  method  that  solves  the  problem  in  a  Bayesian  net¬ 
work.  We  point  out  that  an  expert  may  create  her/his  own  constraints  to  be  used  during 
the  learning,  besides  those  constraints  introduced  by  the  transformation,  as  long  as  such 
constraints  do  not  violate  the  DBN  implicit  constraints.  This  makes  possible  to  learn  DBNs 
together  with  expert’s  knowledge  in  the  form  of  structural  constraints. 
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4.  Properties  of  the  score  functions 

In  this  section  we  present  mathematical  properties  that  are  useful  when  computing  score 
functions.  Local  scores  need  to  be  computed  many  times  to  evaluate  the  candidate  graphs 
when  we  look  for  the  best  graph.  Because  of  decomposability,  we  can  avoid  to  compute 
such  functions  several  times  by  creating  a  cache  that  contains  Sj(IIj)  for  each  Xi  and  each 
parent  set  11*.  Note  that  this  cache  may  have  an  exponential  size  on  n,  as  there  are  2n_1 
subsets  of  {X\, . . .  ,Xn}  \  {Xi}  to  be  considered  as  parent  sets.  This  gives  a  total  space 
and  time  of  0(n  ■  2n  ■  v)  to  build  the  cache,  where  v  is  the  worse  case  asymptotic  time  to 
compute  the  local  score  function  at  each  node.3  Instead,  we  describe  a  collection  of  results 
that  are  used  to  obtain  much  smaller  caches  in  many  practical  cases. 

First,  Lemma  1  is  quite  simple  but  very  useful  to  discard  elements  from  the  cache  of 
each  node  Xi.  It  holds  for  all  score  functions  that  we  treat  in  this  paper.  It  was  previously 
stated  in  Teyssier  and  Koller  (2005)  and  de  Campos  et  al.  (2009),  among  others. 


Lemma  1  Let  Xi  be  a  node  of  G' ,  a  candidate  DAG  for  a  Bayesian  network  where  the 
parent  set  of  Xi  is  II'.  Suppose  Ilj  C  II'  is  such  that  Sj(IIj)  >  Sj(II')  (where  s  is  one  of 
BIC,  AIC,  BD  or  derived  criteria).  Then  II'  is  not  the  parent  set  of  Xi  in  an  optimal  DAG 

G*. 

Proof  This  fact  comes  straightforward  from  the  decomposability  of  the  score  functions. 
Take  a  graph  Q  that  differs  from  Q'  only  on  the  parent  set  of  Xi ,  where  it  has  11*  instead  of 
II'.  Note  that  G  is  also  a  DAG  (as  G  is  a  subgraph  of  G'  built  from  the  removal  of  some  arcs, 
which  cannot  create  cycles)  ands(^)  =  Sj(n')+Sj(IIj)  >  Y,j^i  Sj(n')+Sj(n'j)  =  s(G')- 
Any  DAG  G'  with  parent  set  II'  for  Xi  has  a  subgraph  G  with  a  better  score  than  that  of 
G' ,  and  thus  II'  is  not  the  optimal  parent  configuration  for  Xi  in  G*  ■  ■ 


Unfortunately  Lemma  1  does  not  tell  us  anything  about  supersets  of  II(,  that  is,  we  still 
need  to  compute  scores  for  all  the  possible  parent  sets  and  later  verify  which  of  them  can 
be  removed.  This  would  still  leave  us  with  n  •  2n  •  v  asymptotic  time  and  space  requirements 
(although  the  space  would  be  reduced  after  applying  the  lemma) .  The  next  two  subsections 
present  results  to  avoid  all  such  computations.  BIC  and  AIC  are  treated  separately  from 
BD  and  derivatives  (reasons  for  that  will  become  clear  in  the  derivations). 

4.1  BIC  and  AIC  score  properties 

Next  theorems  handle  the  issue  of  having  to  compute  scores  for  all  possible  parent  sets, 
when  one  is  using  BIC  or  AIC  criteria.  BD  scores  are  dealt  later  on. 

Theorem  2  Using  BIC  or  AIC  as  score  function,  suppose  that  A*,  11*  are  such  that  ?’n*  > 
77 If  II'  is  a  proper  superset  of  Hi,  then  II'  is  not  the  parent  set  of  Xi  in  an  optimal 
structure. 

3.  Note  that  the  time  to  compute  a  single  local  score  might  be  large  depending  on  the  number  of  parents 
but  still  asymptotically  bounded  by  the  data  set  size. 
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Proof  4  We  know  that  II'  contains  at  least  one  additional  node,  that  is,  II'  D  11*  U  { Xe } 
and  Xe  ^  II*.  Because  II*  C  II',  L*(n')  is  certainly  greater  than  or  equal  to  L*(II*),  and 
t*(n')  will  certainly  be  greater  than  the  corresponding  value  i*(n*)  in  Q.  The  difference  in 
the  scores  is  s,(II')  —  s*(H*),  which  equals  to  (see  the  explanations  after  the  formulas): 


maxLi(Il')  -  f*(Il')  -  (maxL*(II*)  -  t*(H*))  < 

e'i  0i 

-  max  L*(II*)  -  t*(n')  +  t*(n*)  = 


Oi 

rni  1 

/  n 

y  nu 

3= 1  ' 

log 

rih 

HJ 


-  ti(n')  +  f*(n*)  < 

3= 1 


r  n,- 


y,  riijlogri  -  rn,  •  (re  -  1)  •  (r*  -  1)  •  w  < 

3= 1 


rni 

y  n*jlogr*  -  rn,  •  (r*  -  1)  •  w  =  Mogr*  -  rni  •  (r*  -  1)  •  re. 
j=i 


The  first  step  uses  the  fact  that  L*(II')  is  negative,  so  we  drop  it,  the  second  step  uses  the 
fact  that  6*jk  =  -^rr,  with  n*j  =  YH=  l  nijk,  the  third  step  uses  the  definition  of  entropy  H(-) 
of  a  discrete  distribution,  and  the  fourth  step  uses  the  fact  that  the  entropy  of  a  discrete 
distribution  is  less  than  the  log  of  its  number  of  categories.  Finally,  the  last  equation  is 
negative  if  7’ip  •  (r*  —  1)  •  w  >  IVlogr*,  which  is  exactly  the  hypothesis  of  the  theorem.  Hence 
si(n')  <  s*(n*),  and  Lemma  1  guarantees  that  n'  cannot  be  the  parent  set  of  X*  in  an 
optimal  structure.  ■ 


Corollary  3  Using  BIC  or  AIC  as  criterion,  the  optimal  graph  Q  has  at  most  O(loglV) 
parents  per  node. 

Proof  Assuming  N  >  4,  we  have  <  1  (because  w  is  either  1  or  lo\N).  Take  a 

variable  X*  and  a  parent  set  H*  with  exactly  |dog2  X]  elements.  Because  every  variable  has 
at  least  two  states,  we  know  that  rn  >  2^^  >  N  >  A12SL.  anci  Theorem  2  we  know 
that  no  proper  superset  of  n*  can  be  an  optimal  parent  set.  ■ 

Theorem  2  and  Corollary  3  ensures  that  the  cache  stores  at  most  0(  (y^  y) )  elements  for 
each  variable  (all  combinations  up  to  [log2X]  parents).  Next  lemma  does  not  help  us  to 
improve  the  theoretical  size  bound  that  is  achieved  by  Corollary  3,  but  it  is  quite  useful  in 
practice  because  it  is  applicable  even  in  cases  where  Theorem  2  is  not,  implying  that  less 
number  of  parent  sets  need  to  be  inspected. 

4.  Another  similar  proof  appears  in  Bouckaert  (1994),  but  it  leads  directly  to  the  conclusion  of  Corollary 
3.  The  intermediate  result  is  algorithmically  important. 
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Theorem  4  Let  BIC  or  AIC  be  the  score  criterion  and  let  Xi  be  a  node  with  11^  C  II'  two 
possible  parent  sets  such  that  f*(n')  +  s*(n*)  >  0.  Then  II'  and  all  supersets  II''  D  II'  are 
not  optimal  parent  configurations  for  Xi . 

Proof  We  have  that  f*(n')  +  s*(n*)  >  0  =>  -um)  -  <  0,  and  because  L*(-)  is  a 

negative  function,  it  implies 


=►  (Li( n')  -  t*(n'))  -  s*(n*)  <  o  =>  si(n')  <  s*(n*). 


Using  Lemma  1,  we  have  that  II'  is  not  the  optimal  parent  set  for  X%.  The  result  also 
follows  for  any  II''  D  n*,  as  we  know  that  i*(n'')  >  f*(n')  and  the  same  argument  suffices. 


Theorem  4  provides  a  bound  to  discard  parent  sets  without  even  inspecting  them.  The 
idea  is  to  verify  the  assumptions  of  Theorem  4  every  time  the  score  of  a  parent  set  11*  of 
Xi  is  about  to  be  computed  by  taking  the  best  score  of  any  subset  and  testing  it  against 
the  theorem.  Only  subsets  that  have  been  checked  against  the  structural  constraints  can 
be  used,  that  is,  a  subset  with  high  score  but  that  violates  constraints  cannot  be  used 
as  the  “certificate”  to  discard  its  supersets  (in  fact,  it  is  not  a  valid  parent  set  at  first). 
This  ensures  that  the  results  are  valid  even  in  the  presence  of  constraints.  Whenever  the 
theorem  can  be  applied,  11*  is  discard  and  all  its  supersets  are  not  even  inspected.  This 
result  allows  us  to  stop  computing  scores  earlier  than  the  worst-case,  reducing  the  number 
of  computations  to  build  and  store  the  cache.  II*  is  also  checked  against  Lemma  1  (which 
is  stronger  in  the  sense  that  instead  of  a  bounding  function,  the  actual  scores  are  directly 
compared).  However  Lemma  1  cannot  help  us  to  avoid  analyzing  the  supersets  of  n*. 

4.2  BD  score  properties 

First  note  that  the  BD  scores  can  be  rewritten  as: 


a* (Hi)  —  y  ] 

j&Ji 


r(ou) 

r  ( otjj  -f-  Titj ) 


+  E log 

fceA'ij 


F  ( o*j  k  T  rii-jk ) 

f  {otijk) 


where  </*  =  J*1*  =  {1  <  j  <  rn*  :  0},  because  =  0  implies  that  all  terms  cancel 

each  other.  In  the  same  manner,  n*^.  =  0  implies  that  the  terms  of  the  internal  summation 
cancel  out,  so  let  Kij  =  K^1  =  {1  <  k  <  r\  :  n*jfc  /  0}  be  the  indices  of  the  categories 
of  Xi  such  that  n*j&  /  0.  Let  if?*  =  L,  /\  /J'  be  a  vector  with  all  indices  corresponding  to 
non-zero  counts  for  H*  (note  that  the  symbol  U  must  be  seen  as  a  concatenation  of  vectors, 
as  we  allow  K^1  to  have  repetitions).  The  counts  n*jj.  (and  consequently  n*j  =  Ylknijk) 
are  completely  defined  if  we  know  the  parent  set  n*.  Rewrite  the  score  as  follows: 

-i(Hi)  =  E  (/ (Kij  i  {aijk)\/k)  T  fj( (^ijk)tfk •  ( )  VA: ) )  • 

i6Ji 
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with 

f(Kij,(aijk)vk)  =  logr(ai;?-)  -  ^  logT(aijk), 

k£l<ij 

(j  (  {Tlj/jk  )  'ik  ■  ( (^ijk  )'ik )  —  log  r(o^'  -j-  71  jj )  -T  ^  (  lo§ -T  {<^ijk  T  f Hjk )• 

keKij 

We  do  not  need  K-ij  as  argument  of  g(-)  because  the  set  of  non-zero  nljk  is  known  from 
the  counts  ( rir,fc)vfc  that  are  already  available  as  arguments  of  g(-).  To  achieve  the  de¬ 
sired  theorem  that  will  be  able  to  reduce  the  computational  time  to  build  the  cache,  some 
intermediate  results  are  necessary. 

Lemma  5  Let  H  be  the  parent  set  of  X.L,  (atjk)vijk  >  0  be  the  hyper-parameters,  and 
integers  ( nijk)\/ijk  >  0  be  counts  obtained  from  data.  We  have  that  g((nijk)vk,(&ijk)vk)  < 
—  logT(u)  ~  0.1214  if  ntj  >  1,  where  v  =  argmaxx>0  —  logT(a;)  ~  1.4616.  Furthermore, 
g((nl3k  )  V  A:  -  ((7ijk  )  V  k  )  if  log  CXjj  -f-  log  Qgjfc  ./'  (  ,  (  OL  j-j  7  )  V  k )  if  \  Xjj  —  T 


Proof  We  use  the  relation  r(.x  +  ^2kak)  >  T(x  +  1)  J([fer(afc),  f°r  x  >  0,  \/kak  >  1  and 
Yfk  ak  —  t  (note  that  it  is  valid  even  if  there  is  a  single  element  in  the  summation).  This 
relation  comes  from  the  Beta  function  inequality: 


r (x)T(y)  <  x  +  y 
T(x  +  y)  ~  xy 


r(x+  l)T(y+  1)  <  T(x  +  y  +  1), 


where  x,  y  >  0.  Applying  the  transformation  y  +  1  =  Yft  at  (which  is  possible  because 
'Yft  >  1  and  thus  y  >  0),  we  obtain: 


r(x  +  at)  >  r(x  +  l)T(^  at)  >  T(®  +  1)  J]  T(at), 

t  t  t 


(8) 


(the  last  step  is  due  to  at  >  1  for  all  t,  so  the  same  relation  of  the  Beta  function  can  be 
overall  applied,  because  T(x  +  1  )T(y  +  1)  <  T(x  +  y  +  1)  <  T(x  +  1  +  y  +  1)). 

With  the  relation  just  devised  in  hands,  we  have 

r (otij  +  nij)  _  ^(X^l <k<ri(aijk  +  nijk))  _ _ 

n keKij  h( aijk  +  nijk )  ](I /  ■  e K,,  ^( aijk  +  nijk) 

T'iYikdKi j  aijk  +  Yhk&Kij  iaijk  +  nijk))  \  - 

= — £  r<1  +  z  “«*>. 

obtained  by  renaming  x  =  ff2kf  k,:i  aijk  and  ak  =  atijk  +  n^k  (we  have  that  (aijk  + 

nijk)  >  nij  >  1  and  each  ak  >  1).  Thus 

g{{nijk)\/k,  (<%fc)vfc)  =  -  log  — T(k' A?  +  - r<-logT(l+  ^  aijk). 

UkeKi^^ijk  +  riijk)  ^ 

Because  v  =  argmaxa,>0  —  logT(x),  we  have  —  logT(l  +  Yfk^Kij  aijk)  <  —  logT(u). 
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Now,  the  second  part  of  the  lemma.  If  \K^j\  =  1,  then  let  Kjj  =  {A;}.  We  know  that 
riij  >  1  and  thus 

(«,,)«) = -  *  "n 

or  nii~ 1  (ct  -  ■  + 1) 

=  -  f(Kij ,  (ayfe)vfc)  -  log  -  Y  loS  (a'Jfc  -  ~  !°g  aij  +  log  aijk  -  f(Ki:j,  (aijk)\/k), 


^ijk 

because  2lJ+I>o  >  1  for  every  t. 


t= 1 


Lemma  6  Let  II,:  be  the  parent  set  of  Xi,  ( ctijk)vijk  >  0  be  the  hyper-parameters,  and 
integers  ( nijk)vijk  >  0  be  counts  obtained  from  data.  We  have  that  g((riijk)yk,  (a,jfc)vfc)  <  0 
if  nij  >  2 . 

Proof  If  >  2,  we  use  the  relation  r(x+^fc  ak)  >  T(x+2)  r(afc),  for  x  >  0,  Vkak  >  1 
and  ^2kak  >  2.  This  inequality  is  obtained  in  the  same  way  as  in  Lemma  5,  but  using  a 
tighter  Beta  function  bound: 

=►  r(i  +  2)r(y  +  2)  <  r(x  +  y  +  2), 


xy  \  x  +  y  +  1  ) 

and  the  relation  follows  by  using  y  +  2  =  at  and  the  same  derivation  as  before.  Now, 

I" (°tij  +  nij)  _  ^(^2i<k<rSaijk  +  nijk)) 
n keKij  r( aijk  +  nijk )  W-keK^  aijk  +  nijk ) 

rCCfci^A;^  aijk  +  12k£Kij(aijk  +  nijk)) 


rifce/vi,  r(aijfc  +  nijk) 


>r(2+  22  aijk)i 

k(£Kij 


obtained  by  renaming  x  =  Yhk^Kij  aijk  and  ak  =  &ijk+nijk,  as  we  know  that  YhkeKi2aijkJr 
ntjk)  >  >  2  and  each  ak  >  1-  Finally, 


g{{nijk)\/ki  {aijk)\/k)  —  log 


because  T(2  +  aijk)  >  1. 


r  (atj  +  ntj) 


El keKa  T (aijk  +  nijk) 


<  -logr(2+  ^  aijk)  <  0, 


HKi: 


Lemma  7  Given  a  BD  score  and  two  parent  sets  II?  and  II,  for  a  node  Xi  such  that 

ii?  c  Hi,  if 


^(n?)  >  Y  /(■frSi»(aSfc)vfc)+  Y  log 


h  -I;h 


ieJfh 

\K?/\  =  1 


.n, 

ijk' 

.nr 

ij 


a 


a 


(9) 


then  Ilj  is  not  the  optimal  parent  set  for  Xi 
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Proof  Using  the  results  of  Lemmas  5  and  6, 

Sim  =  E  («g’fc)vfc)) 

E  ( /^ii ’  (ai/fe)vfe)  +  5((ni/fc)vfe)  (aj/fc)vfc)) 


ieJi 


+ 


jeJc  K  d>2 


+ 


E  (-bgaj  +  bga^,) 


iejf- 


<  E  /(KS<’(%i)vfe)+  E  log 


ieJ8ni:|A'b|>2 


IT; 

%fc' 

ant  : 


which  by  the  assumption  of  this  lemma,  is  less  than  Sj(njl).  Thus,  we  conclude  that  the 
parent  set  II?  has  better  score  than  Ilj,  and  the  desired  result  follows  from  Lemma  1.  ■ 


Lemma  8  Given  the  BDeu  score,  (aijk)\/ijk  >  0,  and  integers  (nijk)\iijk  >  0  such  that 
aij  <  0.8349  and  \Kij\  >  2  for  a  given  j ,  then  f{Kij,  <  —  \Kij\  •  log r*. 


Proof  Using  atJk  <  <  0.8349  (for  all  k) ,  we  have 


QJ  ■  • 

f(Kij,  (ojjfc)vfe)  =  logr(^)  -  mi  iogr(-^-) 

f"i 

r\  ■  ■  • 

=  logr(o^)  -  \Kij\ logT(^  +  1)  +  \Kij\ log  — 

r%  Vi 

rm  + 1) 

=  logr(o^)  -  \Kij\  log  — — - \Kij\  log ri 


Kij  I  l°g 


r 

ir;/  •  i) 


Kij  I  logr*. 


Now,  T(alj)l^K’^al]  <  r(l^L  +  1),  because  r*  >  2,  | >  2  and  a.tj  <  0.8349  (this  number 
can  be  computed  by  numerically  solving  the  inequality  for  r%  =  | =2).  We  point  out 
that  0.8349  is  a  bound  for  atJ  that  ensures  this  last  inequality  to  hold  when  r*  =  | KtJ  \  =  2, 
which  is  the  worst  case  scenario  (greater  values  of  rt  and  |  KjtJ  make  the  left-hand  side  de¬ 
crease  and  the  right-hand  side  increase).  Because  r*  of  each  node  is  known,  tighter  bounds 
might  be  possible  according  to  the  node.  ■ 


Theorem  9  Given  the  BDeu  score  and  two  parent  sets  II.1  and  Ilj  for  a  node  Xt  such  that 
11°  C  II*  and  ajf  <  0.8349  for  every  j,  i/sj(II°)  >  —  \ K'  *|logrj  then  neither  IL  nor  any 
superset  II'  D  Ilj  are  optimal  parent  sets  for  Xt. 

Proof  We  have  that 

Si(II°)  >  -\Kfl |  log n=  E  ~\Kiji I log n  +  E  " log ro 

j-  \k?/\>2  jejfh  \Kff\=i 
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which  by  Lemma  8  is  greater  than  or  equal  to 

(%i)vfc)  +  —  log  ?’j . 

./•  l^"i>2  I 

Now,  Lemma  7  suffices  to  show  that  Hi  is  not  a  optimal  parent  set,  because  —  log  = 

ui 

a  _  * 

log  for  any  k.  To  show  the  result  for  any  superset  IT  D  11*,  we  just  have  to  note  that 

O'.  •* 

n' 

lAT  '*  |  >  |  A/  1 1  (because  the  overall  number  of  non-zero  counts  can  only  increase  when  we 

n7 

include  more  parents),  and  atJ)  (for  all  j')  are  all  less  than  0.8349  (because  the  as  can  only 
decrease  when  more  parents  are  included),  thus  we  can  apply  the  very  same  reasoning  to 
all  supersets.  ■ 

Theorem  9  provides  a  bound  to  discard  parent  sets  without  even  inspecting  them  because 
of  the  non-increasing  monotonicity  of  the  employed  bounding  function  when  we  increase  the 
number  of  parents.  As  done  for  the  BIC  and  AIC  criteria,  the  idea  is  to  check  the  validity  of 
Theorem  9  every  time  the  score  of  a  parent  set  IT  of  A*  is  about  to  be  computed  by  taking 
the  best  score  of  any  subset  and  testing  it  against  the  theorem  (of  course  using  only  subsets 
that  satisfy  the  structural  constraints).  Whenever  possible,  we  discard  IT  and  do  not  even 
look  into  all  its  supersets.  Note  that  the  assertion  aij  <  0.8349  required  by  the  theorem 
is  not  too  restrictive,  because  as  parent  sets  grow,  as  ESS  is  divided  by  larger  numbers  (it 
is  an  exponential  decrease  of  the  as).  Hence,  the  values  a*j  become  quickly  below  such  a 
threshold.  Furthermore,  H*  is  also  checked  against  Lemma  1  (although  it  does  not  help 
with  the  supersets).  As  we  see  later  in  the  experiments,  the  practical  size  of  the  cache  after 
the  application  of  the  properties  is  small  even  for  considerably  large  networks,  and  both 
Lemma  1  and  Theorem  9  help  reducing  the  cache  size,  while  Theorem  9  also  help  to  reduce 
computations.  Finally,  we  point  out  that  Singh  and  Moore  (2005)  have  already  worked 
on  bounds  to  reduce  the  number  of  parent  sets  that  need  to  be  inspected,  but  Theorem  9 
provides  a  much  tighter  bound  than  their  previous  result,  where  the  cut  happens  only  after 
all  | K^fi 1  go  below  two  (or  using  previous  terminology,  when  configurations  are  pure). 

5.  Constrained  B&B  algorithm 

In  this  section  we  describe  the  branch-and-bound  (B&B)  algorithm  used  to  find  the  best 
structure  of  the  Bayesian  network  and  comment  on  its  complexity  and  correctness.  The 
algorithm  uses  a  B&B  search  where  each  case  to  be  solved  is  a  relaxation  of  a  DAG,  that 
is,  the  cases  may  contain  cycles.  At  each  step,  a  graph  is  picked  up  from  a  priority  queue, 
and  it  is  verified  if  it  is  a  DAG.  In  such  case,  it  is  a  feasible  structure  for  the  network  and 
we  compare  its  score  against  the  best  score  so  far  (which  is  updated  if  needed).  Otherwise, 
there  must  be  a  directed  cycle  in  the  graph,  which  is  then  broken  into  subcases  by  forcing 
some  arcs  to  be  absent /present.  Each  subcase  is  put  in  the  queue  to  be  processed  (these 
subcases  cover  all  possible  subgraphs  related  to  the  original  case,  that  is,  they  cover  all 
possible  ways  to  break  the  cycle).  The  procedure  stops  when  the  queue  is  empty.  Note 
that  every  time  we  break  a  cycle,  the  subcases  that  are  created  are  independent,  that  is, 
their  sets  of  graphs  are  disjoint.  We  obtain  this  fact  by  properly  breaking  the  cycles  to 
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avoid  overlapping  among  subcases  (more  details  below).  This  is  the  same  idea  as  in  the 
inclusion-exclusion  principle  of  combinatorics  employed  over  the  set  of  arcs  that  formed  the 
cycle  and  ensures  that  we  never  process  the  same  graph  twice,  and  also  ensures  that  all 
subgraphs  are  covered. 

The  initialization  of  the  algorithm  is  as  follows: 

•  C  :  (Xt  .  IR)  — >•  TZ  is  the  cache  with  the  scores  for  all  the  variables  and  their  possible 
parent  configurations.  This  is  constructed  using  a  queue  and  analyzing  parent  sets 
according  to  the  properties  of  Section  4,  which  saves  (in  practice)  a  large  amount  of 
space  and  time.  All  the  structural  constraints  are  considered  in  this  construction  so 
that  only  valid  parent  sets  are  stored. 

•  Q  is  the  graph  created  by  taking  the  best  parent  configuration  for  each  node  without 
checking  for  acyclicity  (so  it  is  not  necessarily  a  DAG),  and  s  is  the  score  of  Q.  This 
graph  is  used  as  an  upper  bound  to  the  best  possible  graph,  as  it  is  clearly  obtained 
from  a  relaxation  of  the  problem  (the  relaxation  comes  from  allowing  cycles). 

•  Ti  is  an  initially  empty  matrix  containing,  for  each  possible  arc  between  nodes,  a 
mark  stating  that  the  arc  must  be  present,  or  is  prohibited,  or  is  free  (may  be  present 
or  not).  This  matrix  controls  the  search  of  the  B&B  procedure.  Each  branch  of  the 
search  has  a  Ti  that  specifies  the  graphs  that  still  must  be  searched  within  that  branch. 

•  Q  is  a  priority  queue  of  triples  ( Q,Ti,s ),  ordered  by  s  (initially  it  contains  a  single 
triple  with  Q ,  Ti  and  s  as  mentioned.  The  order  is  such  that  the  peak  contains  always 
the  triple  of  greatest  s. 

•  ( Qbest-,  Sbest.)  keeps  at  any  moment  the  best  DAG  and  score  found  so  far  ( Sbest.  is  initial¬ 
ized  with  — oo).  In  fact,  this  best  solution  can  be  initialized  using  any  inner  approx¬ 
imation  method.  For  instance,  we  use  a  procedure  that  guesses  an  ordering  for  the 
variable,  then  computes  the  global  best  solution  for  that  ordering,  and  finally  runs  a 
hill  climbing  over  the  resulting  structure.  All  these  procedures  are  very  fast  (given  the 
small  size  of  the  pre-computed  cache  that  we  obtain  in  the  previous  steps).  A  good 
initial  solution  may  significantly  reduce  the  search  of  the  B&B  procedure,  because  it 
may  give  a  lower  bound  closer  to  the  upper  bound  defined  by  the  relaxation  (Q,  Ti,  -s). 

The  main  loop  of  the  B&B  search  is  as  follows: 

•  While  Q  is  not  empty,  do 

1.  Remove  the  peak  (GCUr,T~(-cur,  scur)  of  Q.  If  scur  <  Sbest  (worse  than  an  already 
known  solution),  then  discard  it  and  start  the  loop  again. 

2.  If  Qcur  is  a  DAG  (it  certainly  satisfies  all  structural  constraints,  because  all  the 
elements  in  the  cache  do  so),  update  {G best,  Sbest)  with  ( Gcur,sCUr )  and  start  the 
loop  again. 

3.  Take  a  cycle  of  Gcur  (one  must  exist,  otherwise  we  would  have  not  reached  this 

step),  namely  v  =  (Aai  — >  Xa2  Xaq+1 ),  with  a\  =  aq+ ±.  This  can  be 

computed  by  a  single  search  in  the  graph. 
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4.  For  y  =  1 ,q,  do 

—  Mark  on  hicur  that  the  arc  Xay  — ►  Xa  a  is  prohibited.  This  implies  that 
the  branch  we  are  going  to  create  will  not  have  this  cycle  again. 

—  Recompute  (G,s)  from  ( GCur,scur )  such  that  the  new  parent  set  of  Xay+1 
in  Q  complies  with  this  new  hicur ■  This  is  done  by  searching  in  the  cache 
C(Xa  +1,  IIa  +1)  for  the  best  parent  set.  If  there  is  a  parent  set  in  the  cache 
that  satisfies  hicur ,  then  include  the  triple  (G ,hicur,  s)  into  Q. 

—  Mark  on  hicur  that  the  arc  X(ly  — >  Xay+1  must  be  present  and  that  the  sibling 
arc  Xa  +1  — >  Xa  is  prohibited,  and  continue  the  loop  of  step  4.  ( This  last 
step  forces  the  branches  that  we  are  creating  to  be  disjoint  among  each  other.) 

There  are  two  considerations  to  show  the  correctness  of  the  method.  First,  we  need  to 
guarantee  that  all  the  search  space  is  considered,  even  though  we  do  not  search  through  all 
of  it.  Second,  we  must  ensure  that  the  same  part  of  the  search  space  is  not  processed  more 
than  once,  so  we  do  not  lose  time  and  know  that  the  algorithm  will  finish  with  a  best  global 
graph.  The  search  is  conducted  over  all  possible  graphs  (not  necessarily  DAGs).  The  queue 
Q  contains  the  subspaces  (of  all  possible  graphs)  to  be  analyzed.  A  triple  (G,  hi,  s)  indicates, 
through  hi,  which  is  this  subspace,  hi  is  a  matrix  containing  an  indicator  for  each  possible 
arc.  It  says  if  an  arc  is  allowed  (meaning  it  might  or  might  not  be  present),  prohibited  (it 
cannot  be  present),  or  demanded  (it  must  be  present)  in  the  current  subspace  of  graphs. 
Thus,  hi  completely  defines  the  subspaces.  G  and  s  are  respectively  the  best  graph  inside  hi 
(note  that  G  might  have  cycles)  and  its  score  value  (which  is  an  upper  bound  for  the  best 
DAG  in  this  subspace). 

In  the  initialization  step,  Q  begins  with  a  triple  where  hi  indicates  that  every  arc  is 
allowed,5  so  all  possible  graphs  are  within  the  subspace  of  the  initial  hi.  The  score  s  of  G 
is  compared  against  the  best  known  score.  Note  that  as  G  is  the  graph  with  the  greatest 
score  that  respects  hi,  any  other  graph  within  the  subspace  defined  by  hi  will  have  worse 
score.  Therefore,  if  s  is  less  than  the  best  known  score,  all  this  branch  represented  by  hi 
may  be  discarded  (this  is  the  bound  step).  Certainly  no  graph  in  that  subspace  will  be 
worth,  because  their  scores  are  less  than  s. 

If  G  has  score  greater  than  Sbest,  then  the  graph  G  is  checked  for  cycles,  as  it  may  or  may 
not  be  acyclic  (all  we  know  is  that  G  is  a  relaxed  solution  within  the  subspace  hi).  If  it  is 
acyclic,  then  G  is  the  best  known  graph  and  the  best  score  is  updated.  If  G  is  cyclic,  then  we 
need  to  divide  the  space  hi  into  smaller  subcases  with  the  aim  of  removing  the  cycles  of  G 
(this  is  the  branch  step).  Two  characteristics  must  be  kept  by  the  branch  step:  (i)  hi  must 
be  fully  represented  in  the  subcases  (so  we  do  not  miss  any  graph),  and  (ii)  the  subcases 
must  be  disjoint  (so  we  do  not  process  the  same  graph  more  than  once).  A  possible  way  to 
achieve  these  two  requirements  is  as  follows:  let  the  cycle  v  =  (Xai  —>  Xa2  Xaq+1) 

be  the  one  detected  in  G ■  We  create  q  subcases  such  that 

•  The  first  subcase  does  not  contain  Xai  — »  Xa2  (but  may  contain  the  other  arcs  of  that 
cycle,  that  is,  we  do  not  prohibit  the  others). 

5.  In  fact,  the  implementation  may  be  smarter  and  set  Tt  with  possible  known  restrictions  of  arcs,  that  is, 
those  that  are  known  to  be  demanded  or  prohibited  by  structural  constraints  may  be  included  in  the 
initial  TL. 
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•  The  second  case  certainly  contains  Xai  — >  Xa2,  but  Xa2  — >  Xa3  is  prohibited  (so  they 
are  disjoint  because  of  the  difference  in  the  presence  of  the  first  arc). 

•  (And  so  on  such  that)  The  y- th  case  certainly  contains  Xa  ,  — >  XQyl+i  for  all  y'  <  y 
and  prohibits  Xay  —>  Xay+1 .  This  is  done  until  the  last  element  of  the  cycle. 

This  is  the  same  idea  as  the  inclusion-exclusion  principle,  but  applied  here  to  the  arcs  of 
the  cycle.  It  ensures  that  we  never  process  the  same  graph  twice,  and  also  that  we  cover 
all  the  graphs,  as  by  the  union  of  the  mentioned  sets  we  obtain  the  original  Tt.  Because  of 
that,  the  algorithm  runs  at  most  n j  I ^ I  steps,  where  \C(Xf)\  is  the  size  of  the  cache 
for  Xj  (there  are  not  more  ways  to  combine  parent  sets  than  that  number).  In  practice, 
we  expect  the  bound  step  to  be  effective  in  dropping  parts  of  the  search  space  in  order  to 
reduce  the  total  time  cost. 

B&B  can  be  stopped  at  any  time  and  the  current  best  solution  as  well  as  an  upper 
bound  for  the  global  best  score  are  available.  This  stopping  criterion  might  be  based  on 
the  number  of  steps,  time  and/or  memory  consumption,  percentage  of  error  (difference 
between  the  upper  and  lower  bounds).  This  is  an  important  property  of  this  method.  For 
example,  if  we  are  just  looking  for  an  improving  solution,  we  may  include  in  the  loop  an 
if  to  check  if  the  current  best  solution  is  already  better  than  some  threshold,  which  would 
save  computational  time.  Still,  if  we  run  it  until  the  end,  we  are  ensured  to  have  a  global 
optimum  solution. 

The  algorithm  can  also  be  easily  parallelized.  We  can  split  the  content  of  the  priority 
queue  into  many  different  tasks.  No  shared  memory  needs  to  exist  among  tasks  if  each  one 
has  its  own  version  of  the  cache.  The  only  data  structure  that  needs  consideration  is  the 
queue,  which  from  time  to  time  must  be  balanced  between  tasks.  With  a  message-passing 
idea  that  avoids  using  process  locks,  the  gain  of  parallelization  is  linear  in  the  number  of 
tasks.  If  run  until  it  ends,  the  proposed  method  gives  a  global  optimum  solution  for  the 
structure  learning  problem. 

Some  particular  cases  of  the  algorithm  are  worth  mentioning.  If  we  fix  an  ordering  for 
the  variables  such  that  all  the  arcs  must  link  a  node  towards  another  non-precedent  in  the 
ordering  (this  is  a  common  idea  in  many  approximate  methods),  the  proposed  algorithm 
does  not  perform  any  branch,  as  the  ordering  implies  acyclicity,  and  so  the  initial  solution  is 
already  the  best  (for  that  ordering  -  the  number  of  possible  orderings  is  exponential  in  n ). 
The  performance  would  be  proportional  to  the  time  to  create  the  cache.  Another  important 
case  is  when  one  limits  the  maximum  number  of  parents  of  a  node.  This  is  relevant  for  hard 
problems  with  many  variables,  as  it  would  imply  in  a  bound  on  the  cache  size. 

6.  Experiments 

We  perform  experiments  to  show  the  benefits  of  the  reduced  cache  and  search  space.  Later 
we  show  some  examples  of  the  use  of  constraints.6  First,  we  use  data  sets  available  at  the 
UCI  repository  (Asuncion  and  Newman,  2007).  Lines  with  missing  data  are  removed  and 
continuous  variables  are  discretized  over  the  mean  into  binary  variables.  The  data  sets  are: 
adult  (15  variables  and  30162  instances),  breast  (10  variables  and  683  instances),  car  (7 
variables  and  1728  instances)  letter  (17  variables  and  20000  instances),  lung  (57  variables 

6.  The  software  is  available  online  in  the  web  address  http://www.ecse.rpi.edu/~cvrl/structiearning.html 
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ESS 

adult 

breast 

car 

letter 

lung 

mush 

nurse 

wdbc 

ZOO 

0.1 

6.2 

0.0 

0.1 

3.7 

1699.6 

7.5 

0.9 

221.2 

0.4 

Memory 

1 

6.2 

0.0 

0.1 

3.7 

1150.1 

5.9 

0.8 

204.6 

0.4 

(in  MB) 

10 

6.3 

0.0 

0.1 

3.8 

812.3 

5.4 

0.7 

206.2 

0.3 

BIC 

1.8 

0.0 

0.0 

2.3 

0.3 

0.5 

0.4 

5.3 

0.1 

0.1 

89.3 

0.0 

0.0 

429.4 

2056 

357.9 

0.7 

2891 

1.7 

Time 

1 

91.6 

0.0 

0.0 

440.4 

1398 

278.7 

0.7 

2692 

1.7 

(in  sec.) 

10 

91.6 

0.0 

0.0 

438.1 

1098 

268.9 

0.7 

2763 

1.7 

BIC 

67.4 

0.0 

0.1 

859.6 

1.3 

72.1 

1.4 

351 

0.3 

0.1 

217A 

210-5 

28-8 

220  1 

230.8 

224.U 

211-2 

227  9 

219-8 

Number 

1 

217A 

210-5 

2S.8 

220  1 

230.2 

223.6 

211-2 

227-8 

219-7 

of  Steps 

10 

217A 

210-4 

2s.s 

220  1 

229.8 

223.5 

211-2 

227  9 

219'6 

BIC 

214-8 

27-3 

28-4 

2190 

215-4 

217-1 

210-9 

220-7 

2131 

Worst-case 

217-9 

212-3 

28'8 

220  1 

2311 

226.5 

211-2 

228.4 

220  1 

Table  1:  Memory,  time  and  number  of  steps  (local  score  evaluations)  used  to  build  the 
cache.  Results  for  BIC  and  BDeu  score  with  ESS  varying  from  0.1  to  10  are 
presented. 


and  27  instances),  mushroom  (23  variables  and  1868  instances,  denoted  by  mush),  nursery 
(9  variables  and  12960  instances,  denoted  by  nurse),  Wisconsin  Diagnostic  Breast  Cancer 
(31  variables  and  569  instances,  denoted  by  wdbc),  zoo  (17  variables  and  101  instances). 
The  number  of  categories  per  variables  varies  from  2  to  dozens  in  some  cases  (we  refer  to 
UCI  for  further  details). 

Table  1  presents  the  used  memory  in  MB  (first  block) ,  the  time  in  seconds  (second  block) 
and  number  of  steps  in  local  score  evaluations  (third  block)  for  the  cache  construction,  using 
the  properties  of  Section  4.  Each  column  presents  the  results  for  a  distinct  data  set.  In 
different  lines  we  show  results  for  BDeu  with  ESS  equals  to  0.1,  1,  10,  and  for  BIC.  The 
line  worst- case  presents  the  number  of  steps  to  build  the  cache  without  using  Theorems 
4  (for  BIC/AIC)  and  9  (for  BDeu),  which  are  the  theorems  that  allow  the  algorithm  to 
avoid  computing  every  subset  of  parents.  As  we  see  through  the  log-scale  in  which  they 
are  presented,  the  reduction  in  number  of  steps  has  not  been  exponential,  but  still  saves 
a  good  amount  of  computations  (roughly  half  of  the  work).  In  the  case  of  the  BIC  score, 
the  reduction  is  more  significant.  In  terms  of  memory,  the  usage  clearly  increases  with  the 
number  of  variables  in  the  network  (lung  has  57  and  wdbc  has  31  variables). 

The  benefits  of  the  application  of  these  results  imply  in  performance  gain  for  many 
algorithms  in  the  literature  to  learn  Bayesian  network  structures,  as  long  as  they  only  need 
to  work  over  the  (already  precomputed)  small  cache.  In  Table  2  we  present  the  final  cache 
characteristics,  where  we  find  the  most  attractive  results,  for  instance,  the  small  cache  sizes 
when  compared  to  the  worst  case.  The  first  block  contains  the  maximum  number  of  parents 
per  node  (averaged  over  the  nodes,  and  the  actual  maximum  between  parenthesis).  The 
worst-case  is  the  total  number  of  nodes  in  the  data  set  minus  one,  apart  from  lung  (where 
we  have  set  a  limit  of  at  most  six  parents)  and  wdbc  (with  at  most  eight  parents).  The 
second  block  shows  the  cache  size  for  each  data  set  and  distinct  values  of  ESS.  We  also 
show  the  results  of  the  BIC  score  and  the  worst-case  values  for  comparison.  We  see  that 
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ESS 

adult 

breast 

car 

letter 

lung 

mush 

nurse 

wdbc 

ZOO 

Max. 

0.1 

2.1(4) 

1.0(1) 

0.7(1) 

4.5(5) 

0.1(2) 

4.1(5) 

1.2(3) 

1.3(2) 

1.4(3) 

Number 

1 

2.4(4) 

1.0(1) 

1.0(2) 

5.2(6) 

0.4(2) 

4.4(7) 

1.7(3) 

1.7(3) 

1.9(4) 

of  Parents 

10 

3.3(5) 

1.0(1) 

1.9(2) 

5.9(6) 

3.0(4) 

4.8(8) 

2.1(3) 

3.1(4) 

3.4(4) 

BIC 

2.8(5) 

1.0(1) 

1.3(2) 

6.3(7) 

2.1(3) 

4.1(4) 

1.8(3) 

2.7(3) 

2.8(3) 

Worst-case 

14.0 

9.0 

6.0 

16.0 

6.0* 

22.0 

8.0 

8.0* 

16.0 

Final  Size 

0.1 

24-2 

21-5 

211 

28'2 

20.2 

2s-5 

21-9 

23-6 

23-3 

of  the 

1 

24-8 

21-9 

21-6 

290 

20.8 

28-9 

22-4 

24-9 

24-4 

Cache 

10 

26-3 

23-3 

230 

210-5 

210-7 

29-8 

23-5 

2121 

28-9 

BIC 

29-3 

24-7 

24.5 

215-3 

211-5 

2130 

25.6 

212-9 

210-9 

Worst-case 

217-9 

212-3 

28,S 

220-1 

231.1* 

226.5 

211-2 

228.4* 

220  1 

Implied 

0.1 

254  1 

213-3 

2s73 

2i29.o 

28.2 

2175.7 

2lle 

2yu.3 

239.3 

Search 

1 

262  1 

2171 

28-3 

2144.8 

233  1 

2186.0 

215-4 

2132.7 

260.3 

Space 

10 

291-6 

233.2 

220.6 

21761 

2612.0 

2221.8 

227-3 

2375.1 

2150.7 

(approx.) 

BIC 

271 

223 

210 

2188 

2330 

2180 

217 

2216 

2111 

Worst-case 

2210 

2" 

242 

2272 

21441* 

2506 

2  72 

2727* 

2272 

Table  2:  Final  cache  characteristics:  maximum  number  of  parents  (average  by  node;  be¬ 
tween  parenthesis  is  presented  the  actual  maximum  number),  actual  cache  size, 
and  (approximate)  search  space  implied  by  the  cache.  Worst-cases  are  presented 
for  comparison  (those  marked  with  a  star  are  computed  using  the  constraint  on 
the  number  of  parents  that  was  applied  to  lung  and  wdbc ).  Results  of  BIC  and 
BDeu  with  ESS  from  0.1  to  10  are  presented. 


the  actual  cache  size  is  smaller  (in  orders  of  magnitude)  than  the  worst  case  situation.  It  is 
also  possible  to  analyze  the  search  space  reduction  implied  by  these  results  by  looking  the 
implications  to  the  search  space  of  structure  learning.  We  must  point  out  that  by  search 
space  we  mean  all  the  possible  combinations  of  parent  sets  for  all  the  nodes.  Eventually 
some  of  these  combinations  are  not  DAGs,  but  are  still  being  counted.  However,  there  are 
two  considerations:  (i)  the  precise  counting  problem  is  harder  to  solve  (in  order  to  give  the 
exact  search  space  size),  and  (ii)  many  structure  learning  algorithms  run  over  more  than 
only  DAGs,  because  they  need  to  look  at  the  graphs  (and  thus  combinations  of  parents)  to 
decide  if  they  are  acyclic  or  not.  In  these  cases,  the  actual  search  space  is  not  simply  the  set 
of  possible  DAGs,  even  though  the  final  solution  will  be  a  DAG.  Still,  some  algorithms  might 
do  a  better  job  by  using  other  ideas  of  searching  for  the  best  structure  instead  of  looking 
to  possible  DAGs,  which  might  imply  in  a  smaller  worst  case  complexity  (for  instance,  the 
dynamic  programming  method  runs  over  subsets  of  variables,  which  are  in  number  2n). 

An  expected  but  important  point  to  emphasize  is  the  correlation  of  the  prior  with  the 
time  and  memory  to  build  the  cache.  It  would  be  expected  that,  as  larger  ESS  (and  thus 
the  prior  towards  the  uniform)  as  slower  and  more  memory  consuming  is  the  method.  That 
is  because  smoothing  the  different  parent  sets  by  the  stronger  prior  makes  harder  to  see 
large  differences  in  scores,  and  consequently  the  properties  that  would  reduce  the  cache  size 
are  less  effective.  However,  this  is  not  quite  evident  from  the  results,  where  the  relation 
between  ESS  and  time/memory  is  not  clear.  Yet  it  must  be  noted  that  the  two  largest 
data  sets  in  terms  of  number  of  variables  ( lung  and  wdbc )  were  impossible  to  be  processed 
without  setting  up  other  limits  such  as  maximum  number  of  parents  or  maximum  number 
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network 

Score 

B&B 

gap 

time 

DP 

score 

time 

OS 

score 

time 

HC 

score 

time 

adult 

-286902.8 

5.5% 

150.3 

0.0% 

0.77 

0.1% 

0.17 

0.5% 

0.30 

breast 

-8254.8 

0.0% 

0.01 

0.0% 

0.01 

0.0% 

0.01 

0.0% 

0.00 

car 

-13100.5 

0.0% 

0.01 

0.0% 

0.01 

0.0% 

0.01 

0.2% 

0.00 

r 

letter 

-173716.2 

8.1% 

574.1 

-0.6% 

22.8 

1.0% 

0.75 

3.7% 

0.30 

w 

i— i 

pq 

lung 

-1146.9 

2.5% 

907.1 

Fail 

Fail 

1.0% 

0.13 

0.7% 

0.05 

mushroom 

-12834.9 

15.3% 

239.8 

Fail 

Fail 

1.0% 

0.12 

4.8% 

0.05 

nursery 

-126283.2 

0.0% 

0.04 

0.0% 

0.04 

0.0% 

0.04 

0.03% 

0.06 

wdbc 

-3053.1 

13.6% 

333.5 

Fail 

Fail 

0.8% 

0.13 

0.9% 

0.02 

zoo 

-773.4 

0.0% 

5.2 

0.0% 

3.5 

1.0% 

0.03 

0.6% 

0.00 

adult 

-288591.2 

0.0% 

92.1 

0.0% 

0.75 

0.1% 

0.21 

0.3% 

0.32 

breast 

-8635.1 

0.0% 

0.02 

0.0% 

0.01 

0.0% 

0.01 

0.0% 

0.00 

car 

-13295.0 

0.0% 

0.01 

0.0% 

0.00 

0.0% 

0.00 

0.1% 

0.01 

o 

letter 

-181941.5 

5.7% 

375.75 

-0.1% 

7.6 

0.1% 

0.27 

2.1% 

0.27 

II 

m 

lung 

-1731.9 

0.0% 

0.22 

Fail 

Fail 

0.0% 

0.11 

0.0% 

0.05 

m 

W 

mushroom 

-12564.2 

14.7% 

382.4 

Fail 

Fail 

0.2% 

0.15 

5.3% 

0.05 

nursery 

-126660.4 

0.0% 

0.06 

0.0% 

0.04 

0.0% 

0.04 

0.1% 

0.06 

wdbc 

-3558.6 

4.4% 

494.1 

Fail 

Fail 

1.4% 

0.05 

1.3% 

0.01 

zoo 

-1024.5 

0.0% 

0.09 

0.0% 

3.1 

0.8% 

0.01 

1.0% 

0.00 

adult 

-286695.2 

4.5% 

203.0 

0.0% 

0.76 

0.1% 

0.22 

0.3% 

0.34 

breast 

-8254.3 

0.0% 

0.02 

0.0% 

0.01 

0.0% 

0.01 

0.0% 

0.00 

car 

-13145.3 

0.0% 

0.01 

0.0% 

0.00 

0.0% 

0.00 

0.05% 

0.00 

t-H 

|| 

letter 

-178635.2 

6.7% 

520.2 

-0.7% 

9.9 

0.0% 

0.34 

2.1% 

0.27 

m 
c n 

lung 

-1249.7 

0.0% 

0.61 

Fail 

Fail 

0.1% 

0.12 

0.1% 

0.05 

W 

mushroom 

-12097.0 

16.7% 

381.5 

Fail 

Fail 

0.2% 

0.19 

4.2% 

0.05 

nursery 

-126212.7 

0.0% 

0.06 

0.0% 

0.04 

0.0% 

0.04 

0.1% 

0.05 

wdbc 

-3175.9 

11.2% 

471.1 

Fail 

Fail 

0.7% 

0.06 

1.0% 

0.02 

zoo 

-794.1 

0.0% 

1.4 

0.0% 

3.4 

1.1% 

0.02 

8.7% 

0.00 

adult 

-285014.5 

11.8% 

213.8 

-0.1% 

0.88 

0.04% 

0.24 

0.5% 

0.33 

breast 

-8130.2 

0.0% 

0.04 

0.0% 

0.01 

0.0% 

0.00 

0.3% 

0.00 

car 

-13038.6 

0.0% 

0.03 

0.0% 

0.00 

0.0% 

0.00 

0.03% 

0.00 

o 

i—i 

letter 

-174111.8 

8.7% 

1250 

-0.4% 

22.3 

0.1% 

0.84 

1.8% 

0.32 

II 

m 

lung 

-957.2 

11.7% 

2118 

Fail 

Fail 

3.3% 

1.38 

2.3% 

0.1 

c n 
W 

mushroom 

-11924.0 

22.7% 

587.8 

Fail 

Fail 

0.1% 

0.43 

2.4% 

0.07 

nursery 

-125846.5 

0.0% 

0.14 

0.0% 

0.04 

0.0% 

0.04 

0.1% 

0.06 

wdbc 

-2986.2 

22.2% 

1938 

Fail 

Fail 

0.6% 

2.8 

1.4% 

0.23 

zoo 

-697.2 

13.2% 

367.7 

-0.3% 

5.0 

1.4% 

0.1 

0.9% 

0.00 

Table  3:  Comparison  of  scores  among  B&B,  DP,  OS  and  HC.  Fail  means  that  it  could  not 
solve  the  problem  within  10  million  steps  or  because  of  memory  limit  (4GB).  DP, 
OS  and  HC  scores  are  in  percentage  w.r.t.  the  score  of  B&B  (positive  means 
worse  than  B&B  and  negative  means  better).  Each  entry  with  a  0.0%  means  that 
the  result,  in  that  instance,  was  exactly  equal  to  the  B&B  result  (in  terms  of  the 
score).  Times  are  given  in  seconds. 


of  free  parameters  in  the  node  (we  have  not  used  any  limit  for  the  other  data  sets).  We  used 
an  upper  limit  of  six  parents  per  node  for  lung  and  eight  for  wdbc.  This  situation  deserves 
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further  study  so  as  to  clarify  whether  it  is  possible  to  run  these  computations  on  large  data 
sets  and  large  ESS.  It  might  be  necessary  to  find  tighter  bounds  if  at  all  possible,  that 
is,  stronger  results  than  Theorem  9  to  discard  unnecessary  score  evaluations  earlier  in  the 
computations.  Nevertheless,  the  main  goal  of  this  present  work  is  not  to  study  the  impact 
of  ESS  on  learning,  but  to  present  properties  that  improve  the  performance  of  learning 
methods. 

In  Table  3,  we  show  results  of  four  distinct  algorithms:  the  B&B  described  in  Section  5, 
the  dynamic  programming  (DP)  idea  of  Silander  and  Myllymaki  (2006),  the  hill-climbing 
(HC)  method  starting  with  an  empty  structure,  and  an  algorithm  that  picks  variable  or¬ 
derings  randomly  and  then  find  the  best  structure  such  that  all  arcs  link  a  node  towards 
another  that  is  not  precedent  in  the  ordering.  This  algorithm  (named  OS)  is  similar  to 
K2  algorithm  with  random  orderings,  but  it  is  always  better  because  a  global  optimum  is 
found  for  each  ordering.  Note  that  OS  performs  better  than  HC  in  almost  all  test  cases. 
We  have  chosen  to  analyze  the  BIC  scores  (given  that  the  properties  have  provided  greater 
reduction  in  the  search  space  in  this  case)  and  BDeu  with  ESS  equals  to  0.1,  1  and  10.  It 
is  clear  from  the  results  of  ESS  equals  to  10  that  the  B&B  procedure  struggles  with  very 
large  search  spaces,  and  the  same  might  happen  for  even  larger  ESS. 

The  scores  obtained  by  each  algorithm  (in  percentage  against  the  value  obtained  by 
B&B)  and  the  corresponding  time  are  shown  in  Table  3  (excluding  the  cache  construction). 
A  limit  of  ten  million  steps  is  given  to  each  method  (steps  here  are  considered  as  the  number 
of  queries  to  the  cache).  It  is  also  presented  the  reduced  space  where  B&B  performs  its 
search,  as  well  as  the  maximum  gap  of  the  solution.  This  gap  is  obtained  by  the  relaxed 
version  of  the  problem.  We  can  guarantee  that  the  global  optimal  solution  is  within  this 
gap  (even  though  the  solution  found  by  the  B&B  may  already  be  the  best,  as  it  happens, 
for  example,  in  the  first  line  of  the  table).  With  the  reduced  cache  presented  here,  finding 
the  best  structure  for  a  given  ordering  is  very  fast,  so  it  is  possible  to  run  OS  over  millions 
of  orderings  in  a  short  period  of  time.  Some  additional  comments  are  worth.  DP  could 
not  solve  wdbc  or  lung  even  without  the  limit  in  number  of  steps,  because  it  has  exhausted 
16GB  of  memory.  Hence,  we  cannot  expect  to  obtain  answers  in  larger  cases.  However,  it 
is  clear  that  (in  a  worst  case  sense)  the  number  of  steps  of  DP  is  smaller  than  that  of  B&B, 
and  this  behavior  can  be  seen  in  data  sets  with  small  number  of  variables.  Nevertheless, 
B&B  eventually  bounds  some  regions  without  processing  them,  provides  an  upper  bound 
at  each  iteration,  and  does  not  suffer  from  memory  exhaustion  as  DP.  It  is  true  that  B&B 
also  uses  memory  increasingly  if  there  are  not  good  bounds,  but  this  case  can  be  tackled 
by  (automatically)  switching  between  B&B  and  a  depth-first  search.7  This  makes  the 
method  applicable  even  to  very  large  settings.  When  n  is  large  (more  than  35),  DP  will 
not  finish  in  reasonable  time,  and  hence  will  not  provide  any  solution,  while  B&B  still 
gives  an  approximation  and  a  bound  to  the  global  optimum.  About  OS,  if  we  sample  even 
more  orderings,  then  its  results  improve  and  the  global  optimum  is  found  also  for  adult  and 
mushroom  sets.  Still,  OS  provides  no  guarantee  or  estimation  about  how  far  is  the  global 
optimum  (here  we  know  that  the  optimum  has  been  achieved  because  of  the  solution  of 
the  exact  methods).  It  is  worth  noting  that  both  DP  and  OS  are  also  benefited  by  the 


7.  Our  implementation  is  able  to  switch  between  breath-first  and  depth-first  searches,  but  this  behavior 
was  not  used  in  the  experiments  of  this  paper. 
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smaller  cache.  Although  we  are  discussing  only  four  algorithms,  performance  gain  from  the 
application  of  the  properties  in  other  algorithms  is  expected  as  well. 


network 

time(s) 

cache  size 

space 

adult 

0.26 

114 

2ay 

car 

0.01 

14 

26-2 

letter 

0.32 

233 

261 

lung 

0.26 

136 

251 

mushroom 

0.71 

398 

2S8 

nursery 

0.06 

26 

212 

wclbc 

361.64 

361 

2" 

zoo 

8.4 

1697 

2111 

Table  4:  B&B  procedure  learning  TANs  using  BIC.  Time  (in  seconds)  to  find  the  global 
optimum,  cache  size  (number  of  stored  scores)  and  (reduced)  space  for  the  B&B 
search. 


The  last  part  of  this  section  is  dedicated  to  some  test  cases  with  constraints.  Table 
4  shows  the  results  when  we  employ  constraints  to  force  the  final  network  to  be  a  Tree- 
augmented  Naive  Bayes.  Here  the  class  variable  is  isolated  in  the  data  set  and  constraints 
are  included  as  described  in  Section  3.  Note  that  the  cache  size,  the  search  space  and  con¬ 
sequently  the  time  to  solve  the  problems  have  substantially  decreased.  Finally,  Table  5  has 
results  for  random  data  sets  with  predefined  number  of  nodes  and  instances  using  the  BIC 
score.  A  randomly  created  Bayesian  network  with  at  most  3n  arcs  (where  n  is  the  number 
of  nodes)  is  used  to  sample  the  data.  Because  of  that,  we  are  able  to  generate  random  struc¬ 
tural  constraints  that  are  certainly  valid  for  this  true  Bayesian  network  (approximately  n 
constraints  for  each  case).  The  table  contains  the  total  time  to  run  the  problem  and  the 
size  of  the  cache,  together  with  the  results  when  using  constraints.  Note  that  the  code  was 
run  in  parallel  with  a  number  of  tasks  equals  to  n,  otherwise  an  increase  by  a  factor  of 
n  must  be  applied  to  the  results  in  the  table.  Each  line  contains  the  mean  and  standard 
deviation  of  ten  executions  (using  random  generated  networks)  for  time  and  cache  size  with 
and  without  constraints  (using  the  same  data  sets  in  order  to  compare  them).  We  can  see 
that  the  gain  is  recurrent  in  all  cases.  The  B&B  method  was  able  to  find  a  global  optimal 
solution  in  all  but  the  cases  with  one  hundred  nodes,  where  it  has  achieved  an  approximate 
solution  with  error  always  less  than  0.1%  (this  amounts  to  40%  of  the  test  cases  with  100 
nodes).  We  point  out  that  the  other  exact  method  we  have  analyzed  based  on  dynamic 
programming  cannot  deal  with  such  large  networks  because  of  both  memory  and  time  costs. 
However,  we  are  not  considering  the  improvement  in  accuracy  when  using  constraints,  but 
just  the  computational  gain.  It  is  not  trivial  to  measure  the  quality  of  a  learned  structure, 
because  the  target  of  the  methods  is  the  underlying  probability  distribution,  and  distinct 
structures  may  lead  to  good  results  in  fitting  such  distribution.  For  instance,  comparing 
number  of  matching  arcs  has  only  meaning  if  one  is  interested  in  the  structure  by  itself, 
and  not  in  the  fitness  of  the  underlying  distribution.  This  topic  deserves  attention,  but  it 
would  bring  us  far  from  the  focus  of  this  study. 


23 


de  Campos  and  Ji 


unconstrained 

constrained 

nodes  (n)/ 

time( 

sec) 

cache  size 

time  (sec) 

cache  size 

instances 

mean 

std.dev. 

mean 

std.dev. 

mean 

std.dev. 

mean 

std.dev. 

30/100 

0.07 

0.02 

49.6 

9.1 

0.04 

0.01 

44.3 

8.98 

30/500 

3.70 

1.18 

75.6 

16.6 

2.33 

0.73 

61.4 

17.7 

50/100 

0.31 

0.08 

77.9 

9.6 

0.20 

0.04 

66.1 

6.71 

50/500 

37.1 

10.8 

102.5 

23.0 

23.2 

6.86 

83.0 

17.7 

70/100 

1.91 

0.82 

127.5 

18.1 

0.97 

0.32 

108.3 

13.6 

70/500 

293.3 

99.5 

137.3 

22.2 

176.3 

62.6 

111.8 

14.5 

100/100 

85.0 

29.3 

253.4 

27.7 

4.44 

1.06 

199.5 

21.1 

100/500 

2205.6 

534.4 

204.6 

32.1 

1414.8 

419.2 

168.0 

21.3 

Table  5:  Results  on  ten  data  sets  per  line  generated  from  random  networks.  Both  mean  and 
standard  deviation  of  time  to  solve  (with  an  upper  limit  of  20  million  steps)  and 
size  of  the  cache  (in  number  of  scores)  are  presented  for  the  normal  unconstrained 
case  and  for  the  constrained  cases  (over  the  same  data  sets). 


7.  Conclusions 

This  paper  describes  novel  properties  of  decomposable  score  functions  to  learn  Bayesian  net¬ 
work  structure  from  data.  Such  properties  allow  the  construction  of  a  cache  with  all  possible 
local  scores  of  nodes  and  their  parents  without  large  memory  consumption,  which  can  later 
be  used  by  searching  algorithms.  For  instance,  memory  consumption  was  a  bottleneck  for 
some  algorithms  in  the  literature,  see  for  example  Parviainen  and  Koivisto  (2009).  This 
implies  in  a  considerable  reduction  of  the  search  space  of  graphs  without  losing  the  global 
optimal  structure,  that  is,  it  is  ensured  that  the  overall  best  graph  remains  in  the  reduced 
space.  In  fact  the  reduced  memory  and  search  space  potentially  benefits  many  structure 
learning  methods  in  the  literature,  and  we  have  conducted  experiments  with  some  of  them. 

An  algorithm  based  on  a  branch-and-bound  technique  is  described,  which  integrates 
structural  constraints  with  data.  The  procedure  guarantees  global  optimality  with  respect 
the  score  function.  It  is  an  any-time  procedure  in  the  sense  that,  if  stopped  early,  it  provides 
the  best  current  solution  found  so  far  and  a  maximum  error  of  such  solution.  This  is  specially 
important  if  one  wants  to  integrate  it  with  an  expectation-maximization  (EM)  method  to 
treat  incomplete  data  sets,  and  such  characteristic  is  usually  not  present  in  other  exact 
structure  learning  methods.  Inside  the  EM  method,  the  global  structure  learning  procedure 
ensures  that  the  maximization  step  is  never  trapped  by  a  local  solution,  and  the  anytime 
property  allows  the  use  of  a  generalized  EM  idea  to  reduce  considerably  the  computational 
cost. 

Because  of  the  properties  and  the  characteristics  of  the  B&B  method,  it  is  more  efficient 
than  dynamic  programming  state-of-the-art  exact  methods  for  large  domains.  We  show 
through  experiments  with  randomly  generated  data  and  public  data  sets  that  problems  with 
up  to  70  nodes  can  be  exactly  processed  in  reasonable  time,  and  problems  with  100  nodes  are 
handled  within  a  small  worst  case  error.  Dynamic  programming  methods  are  able  to  treat 
less  than  35  variables.  Described  ideas  may  also  help  to  improve  other  approximate  methods 
and  may  have  interesting  practical  applications.  We  show  through  experiments  with  public 


24 


Efficient  Structure  Learning  of  Bayesian  Nets 


data  sets  that  requirements  of  memory  are  small,  as  well  as  the  resulting  reduced  search 
space.  Of  course  we  do  not  expect  to  exactly  solve  problems  for  considerably  large  networks, 
still  the  paper  makes  a  relevant  step  towards  solving  larger  instances.  We  can  summarize  the 
comparison  with  the  dynamic  programming  idea  as  follows:  if  the  problem  has  few  variables, 
dynamic  programming  is  probably  the  fastest  method  (the  branch-and-bound  method  will 
also  be  reasonably  fast);  if  the  problem  has  medium  size,  the  branch-and-bound  method 
might  solve  it  exactly  (dynamic  programming  will  mostly  fail  to  answer);  finally,  if  the 
problem  is  large,  the  branch-and-bound  method  will  eventually  give  an  approximation  (and 
its  worst-case  error),  while  the  standard  dynamic  programming  idea  will  fail. 

There  is  certainly  much  further  to  be  done.  One  important  question  is  whether  the 
bounds  of  the  theorems  in  Section  4  (more  specifically  Theorem  9)  can  be  improved  or 
not.  We  are  actively  working  on  this  question.  Furthermore,  the  experimental  analysis 
can  be  extended  to  further  clarify  the  understanding  of  the  problem,  for  instance  how  the 
ESS  affects  the  results.  It  is  clear  that,  for  considerably  large  domains,  none  of  the  exact 
methods  are  going  to  suffice  by  themselves.  Besides  developing  ideas  and  algorithms  for 
dealing  with  large  domains,  the  comparison  of  structures  and  what  define  them  to  be  good  is 
an  important  topic.  For  example,  accuracy  of  the  generated  networks  can  be  evaluated  with 
real  data.  On  the  other  hand,  it  does  not  ensure  that  we  are  finding  the  true  links  of  the 
underlying  structure,  but  a  somehow  similar  graph  that  produces  a  close  joint  distribution. 
For  that,  one  could  use  generated  data  and  compare  the  structures  against  the  one  data 
were  generated  from  it.  A  study  on  how  the  properties  may  help  fast  approximate  methods 
is  also  a  desired  goal. 
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Simultaneous  Influence  Diagrams 
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Abstract — Evaluating  an  influence  diagram  (ID)  is  a  challeng¬ 
ing  problem  because  its  complexity  increases  exponentially  in 
the  number  of  decision  nodes  in  the  diagram.  In  this  paper,  we 
examine  the  problem  for  a  special  class  of  IDs  where  multiple 
decisions  must  be  made  simultaneously.  We  describe  a  brief  theory 
that  factorizes  out  the  computations  common  to  all  policies  in 
evaluating  them.  Our  evaluation  approach  conducts  these  com¬ 
putations  once  and  uses  them  across  all  policies.  We  identify  the 
ID  structures  for  which  the  approach  can  achieve  savings.  We 
show  that  the  approach  can  be  used  to  efficiently  recompute  the 
optimal  policy  of  an  ID  when  its  structure  or  parameters  change. 
Finally,  we  demonstrate  the  superior  performance  of  the  approach 
by  simulation  studies  and  a  military  planning  example. 

Index  Terms — Algorithm,  decision  making  under  uncertainty, 
graphical  model,  influence  diagram  (ID),  military  analysis. 

I.  Introduction 

AN  INFLUENCE  diagram  (ID)  is  a  plausible  graphical 
model  for  decision  making  under  uncertainty  [1],  An  ID 
comprises  of  decision  nodes,  random  nodes,  value  nodes,  and 
the  probabilistic  relations  among  these  nodes.  An  ID  is  a  more 
compact  representation  of  a  decision  tree,  which  is  a  simple  tool 
for  decision  analysis  [2], 

Given  an  ID,  a  policy  prescribes  an  action  choice  for  each 
decision  node.  Evaluating  a  policy  is  to  compute  the  expected 
value  of  the  ID  under  the  policy.  Evaluating  an  ID  is  to  find 
the  optimal  policy  that  maximizes  the  expected  value  of  the 
ID.  A  generic  approach  to  evaluating  an  ID  has  to  enumerate 
all  policies,  compare  the  expected  utilities  under  them,  and 
choose  the  optimal  one.  However,  the  number  of  policies  grows 
exponentially  with  the  number  of  decision  nodes.  This  renders 
the  approaches  for  general  ID  evaluation  very  inefficient  and 
infeasible  for  large  IDs.  Consequently,  it  is  advisable  to  study 
efficient  algorithms  for  special  IDs. 

Most  of  the  previous  approaches  assume  that  there  exists  a 
linear  ordering  among  the  decision  nodes.  This  ordering  implies 
that  the  choice  of  a  decision  node  is  known  to  the  decision 
maker  when  he/she  chooses  the  actions  for  the  successive 
decision  nodes  (e.g.,  see  [13]).  For  a  decision  node,  this  linear 
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ordering  usually  can  be  exploited  to  decompose  the  ID  into  one 
fraction  prior  to  the  node  and  the  other  fraction  posterior  to  the 
node.  The  choice  for  the  decision  node  can  be  made  using  the 
fraction  posterior  to  the  node.  The  procedure  repeats  for  each 
decision  node. 

In  this  paper,  we  examine  the  ID  evaluation  problem  for  a 
special  class  of  IDs  in  which  decision  nodes  have  no  parents. 
Essentially,  an  ID  with  this  property  assumes  no  precedence 
relationship  among  decision  nodes.  In  other  words,  one  has  to 
determine  the  choices  for  all  decision  nodes  simultaneously. 
For  this  reason,  such  an  ID  is  said  to  be  simultaneous.  The  si¬ 
multaneity  assumption  prevails  in  real-world  problem  domains. 
For  instance,  a  military  planner  must  select  among  a  number 
of  available  actions  to  achieve  his/her  overall  goal  success;  a 
business  owner  must  consider  multiple  elements  in  order  to 
maximize  his/her  monetary  profit. 

In  evaluating  a  simultaneous  ID,  we  exploit  the  assumption 
and  divide  the  ID  into  two  fractions,  calling  them  the  upstream 
and  downstream.  Roughly,  the  upstream  consists  of  decision 
nodes  and  their  children  nodes  through  which  the  decisions 
propagate  their  impacts  on  the  ID.  Informally,  these  child 
nodes  are  called  interface  nodes.  The  downstream  consists  of 
the  interface  nodes  and  their  succeeding  nodes.  We  present 
a  representation  theorem,  showing  that  the  expected  value  of 
a  value  node  under  a  policy  can  be  represented  as  the  sum 
of  some  intermediate  quantities  weighted  by  the  probabilities 
determined  by  the  policy.  These  intermediate  quantities  involve 
only  the  downstream.  The  factorization  approach  we  proposed 
computes  them  once  but  uses  them  across  all  policies.  The 
computational  gain  brought  by  the  approach  depends  on  the 
size  of  the  downstream.  Usually,  larger  downstream  size  im¬ 
plies  more  savings. 

We  organize  the  paper  as  follows.  In  the  next  section,  we 
discuss  related  work  to  this  research.  We  then  introduce  IDs 
and  the  evaluation  problem.  In  Section  IV,  we  describe  the 
representation  theorem  and  develop  the  factorization  approach. 
In  Section  V,  we  discuss  two  extensions  of  the  approach:  how 
it  can  be  adapted  to  network  structure/parameter  changes  and 
how  it  can  be  used  in  planning  over  time.  We  report  empirical 
results  on  simulation  studies  and  a  military  planning  example 
in  Section  VI.  Finally,  we  conclude  the  paper  in  Section  VII. 

II.  Related  Work 

Since  IDs  were  introduced  by  Howard  and  Matheson  [1], 
a  variety  of  approaches  have  been  proposed  to  find  the  opti¬ 
mal  policy  of  a  given  ID.  To  mitigate  the  exponential  growth 
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problem  of  the  policy  number  in  the  number  of  decision  nodes, 
researchers  have  studied  several  special  ID  classes  and  pro¬ 
posed  efficient  approaches  exploiting  their  specific  problem 
characteristics.  We  give  a  brief  survey  of  these  IDs  and  their 
solutions. 

A.  Regular  and  No-Forgetting  IDs 

To  some  extent,  most  IDs  that  have  been  studied  assume 
a  precedence  ordering  of  the  decision  nodes.  A  regular  ID 
assumes  that  there  is  a  directed  path  containing  all  decision 
nodes;  a  no-forgetting  ID  assumes  that  each  decision  node  and 
its  parents  are  also  parents  of  the  successive  decision  nodes;  and 
a  stepwise  decomposable  ID  assumes  that  the  parents  of  each 
decision  node  divide  the  ID  into  two  separate  fractions.  These 
assumptions  are  different  from  ours,  which  requires  the  actions 
to  be  chosen  simultaneously.  There  exist  direct  and  indirect 
approaches  evaluating  a  regular  no-forgetting  ID.  A  direct 
approach  works  on  the  ID  and  evaluates  it  directly.  Shachter  [3] 
proposed  an  algorithm  that  evaluates  an  ID  by  applying  a  series 
of  value-preserving  reductions.  A  value-preserving  reduction  is 
an  operation  that  can  transform  an  ID  into  another  one  with 
the  same  expected  value.  Specifically,  Shachter  identified  the 
following  four  reductions  arc  reversal,  barren-node  removal, 
random-node  removal,  and  decision-node  removal.  An  indirect 
approach  first  transforms  an  ID  into  an  intermediate  struc¬ 
ture  whose  optimal  policy  (or  value)  remains  the  same  as  in 
the  original  ID.  It  then  evaluates  the  intermediate  structure 
and  obtains  the  optimal  policy.  For  instance,  Howard  and 
Matheson  discussed  a  way  to  transform  an  ID  into  a  decision- 
tree  network  and  to  compute  an  optimal  policy  from  the  deci¬ 
sion  tree.  In  transforming  an  ID  into  a  decision-tree  network,  a 
basic  operation  is  arc  reversal  [1],  [3],  Since  a  no-forgetting  ID 
must  be  stepwise  decomposable,  stepwise  decomposability  is 
more  general  than  no-forgetting. 

In  most  ID  evaluation  approaches,  the  ordering  of  decision 
nodes  is  an  important  information  source  in  decision  making 
and  therefore,  is  exploited  to  evaluate  the  optimal  decision 
for  decision  nodes  [4]-[6],  A  stepwise  decomposable  ID  can 
be  evaluated  by  a  divide-and-conquer  approach.  The  approach 
deals  with  one  decision  node  at  a  time  [7],  For  each  decision 
node,  its  parental  set  separates  an  ID  into  two  parts — a  body  and 
a  tail.  The  tail  is  a  simple  ID  with  only  one  decision  node.  The 
body’s  value  node  is  a  new  value  node  whose  value  function 
is  obtained  by  evaluating  the  tail.  In  evaluating  a  stepwise 
decomposable  ID,  the  approach  begins  with  a  leaf  decision 
node  and  repeats  the  decomposition/evaluation  procedure  for 
the  preceding  decision  nodes.  In  evaluating  the  tail  with  only 
one  single  decision  node,  the  problem  is  reduced  to  that  of 
computing  posterior  probabilities  in  a  Bayesian  network. 
Hence,  the  approach  uses  probabilistic  inference  techniques  to 
evaluate  an  ID.  Cooper  [8]  initiated  the  research  in  this  direc¬ 
tion.  He  gave  a  recursive  formula  for  computing  the  maximal 
expected  utilities  and  optimal  policies  of  IDs.  Shachter  and 
Peot  [9]  showed  that  the  problem  of  ID  evaluation  can  be 
reduced  to  a  series  of  probabilistic  inferences.  Zhang  [13] 
described  an  algorithm  that  induces  much  easier  probabilistic 
inferences  than  those  in  [8]  and  [9], 


B.  Partial  IDs 

There  also  exists  research  work  that  relaxed  the  regularity  or 
no-forgetting  assumption.  The  specific  ID  types  include  partial 
IDs,  unconstrained  IDs  and  limited  memory  ID  (LIMID),  which 
is  a  compact  representation  of  IDs.  A  partial  ID  is  an  ID  that 
allows  a  non-total  ordering  of  decision  nodes  [10].  Because 
the  solution  to  a  partial  ID  depends  on  the  temporal  ordering 
of  the  decisions,  it  is  of  interest  to  find  the  conditions  iden¬ 
tifying  a  class  of  partial  IDs  whose  solution  is  independent 
of  the  legal  evaluation  ordering.  Based  on  the  concept  of 
d-connectivity,  Nielsen  and  Jensen  presented  an  algorithm  de¬ 
termining  whether  or  not  a  partial  ID  represents  well-defined 
scenarios,  and  they  also  addressed  the  problem  of  whether  all 
admissible  orderings  yield  the  same  optimal  strategy. 

An  unconstrained  ID  is  an  ID  where  the  order  of  decision 
nodes  and  the  observable  random  nodes  is  not  determined 
[11],  For  an  unconstrained  ID,  it  is  of  interest  to  determine 
the  order  of  decision  nodes  and  information  on  which  set  of 
nodes  is  necessary  for  decision  making  in  a  decision  node. 
For  this  purpose,  a  set  of  rules  have  been  developed  in  order 
to  determine  the  choice  of  the  next  decision  node,  given  the 
current  information.  Such  a  decision  choice  may  be  dependent 
on  the  specific  information  from  the  past. 

Another  recently  proposed  ID  is  called  LIMID,  which 
violates  the  no-forgetting  assumption  [12],  In  contrast  to  the 
regular  and  no-forgetting  assumption,  the  assumption  behind  a 
LIMID  is  that  only  requisite  information  for  the  computation 
of  optimal  policies  is  depicted  in  the  graphical  representation. 
Two  properties  pertaining  to  LIMIDs  are:  1)  any  ID  can  be 
converted  to  a  LIMID;  and  2)  the  converted  LIMID  is  more 
compact  than  the  original  ID  in  the  sense  that  only  requisite 
information  is  depicted  in  the  LIMID  for  computing  an  optimal 
policy.  By  these  properties,  one  may  convert  an  ID  to  its 
LIMID  version  and  solve  the  LIMID  instead  of  the  original 
ID.  This  optimal  policy  is  also  optimal  in  the  original  ID.  The 
algorithm  solving  a  LIMID  exploits  the  fact  that  the  entire  de¬ 
cision  problem  can  be  partitioned  into  a  set  of  smaller  decision 
problems,  each  of  which  has  one  decision  node  only.  This  is 
analogous  to  the  divide-and-conquer  approach  [13]. 

C.  Simultaneous  IDs 

From  its  root  definition,  an  ID  does  not  impose  a  prece¬ 
dence  ordering  of  the  decision  nodes.  As  an  example,  there 
are  military  applications  that  need  to  choose  multiple  actions 
simultaneously.  A  simultaneous  ID  is  suitable  for  this  situation. 
We  exploit  this  assumption  and  divide  a  simultaneous  ID  into 
the  upstream  and  the  downstream  fractions.  The  decomposition 
takes  the  random  and  value  nodes  as  interface  nodes  between 
the  upstream  and  the  downstream.  The  computations  involv¬ 
ing  the  downstream  fraction  can  be  precomputed  and  reused 
across  all  policies  in  evaluating  them.  This  computation-sharing 
schema  can  greatly  accelerate  the  procedure  of  finding  the 
optimal  policy  for  a  given  ID,  as  indicated  in  our  theoretical 
and  empirical  analysis. 

Technically,  the  factorization  approach  has  some  conceptual 
similarities  to  the  probabilistic  inference-based  algorithm  [13]. 
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Both  algorithms  divide  the  ID  into  two  fractions.  However, 
there  are  apparent  differences.  In  [13],  the  separation  of  an  ID 
relies  on  a  single  decision  node.  With  respect  to  a  decision  node, 
roughly,  the  body  contains  the  predecessors  of  the  decision 
nodes,  while  the  tail  contains  the  successors.  The  choice  of  the 
decision  node  is  evaluated  by  the  tail  part.  This  is  quite  different 
from  our  factorization  approach,  where  the  separation  relies  on 
the  set  of  interface  nodes.  The  set  of  the  interface  nodes  sepa¬ 
rates  an  ID  into  two  fractions:  roughly,  the  upstream  contains 
the  predecessors  of  all  interface  nodes,  while  the  downstream 
contains  the  successors.  This  difference  in  solution  techniques 
stems  from  the  difference  in  assumption — the  probabilistic 
inference  algorithm  works  with  a  regular  ID  that  specifies  a 
linear  order  among  decision  nodes,  whereas  the  factorization 
algorithm  works  with  a  simultaneous  ID  that  assumes  no  order¬ 
ing  among  decision  nodes. 

III.  Influence  Diagram 

Mathematically,  an  ID  X  is  a  directed  acyclic  graph  consist¬ 
ing  of  three  types  of  nodes  and  the  links  among  these  nodes  [1]. 

1)  Its  node  set  is  partitioned  into  a  set  of  random  nodes  y, 
a  set  of  decision  nodes  X ,  and  a  set  of  value  nodes  U.  A 
value  node  cannot  have  children.  The  links  characterize 
the  conditional  dependence  among  the  nodes  in  the  ID. 
Specifically,  links  to  a  random  node  indicate  the  proba¬ 
bilistic  dependence  of  the  node  on  its  parents;  links  to 
a  decision  node  indicate  the  information  available  to  the 
planner  at  the  time  the  planner  must  choose  a  decision 
for  it;  and  links  to  a  value  node  indicate  the  functional 
dependencies. 

We  will  adopt  the  following  notational  conventions. 
We  will  use  bold-typed  letters  such  as  Z  to  denote  a  set  of 
variables  and  capital  letters  such  as  Z  to  denote  a  variable 
in  the  set.  Each  random  or  decision  node  Z  is  associated 
with  a  set  Hz,  denoting  the  set  of  its  possible  states.  The 
set  f \z  is  called  the  domain  of  node  Z.  An  element  in  Hz 
is  denoted  by  a  low-case  letter  z.  For  any  node  Z,  we  use 
7 r(Z)  to  denote  its  parent  set.  For  any  subset  Z'  C  y  U  X, 
we  use  f \z‘  to  denote  the  Cartesian  product  IlzeZ'^z- 
For  convenience,  we  shall  interchangeably  use  a  node  and 
a  variable.  Without  loss  of  generality,  we  assume  that  all 
the  nodes  are  binary  throughout  this  paper. 

2)  For  each  decision  or  random  node  X,  given  an  assign¬ 
ment  of  7r(Z),  the  distribution  P(Z\ir(Z))  specifies  the 
probability  of  Z  being  in  each  state  of  the  node  Z.  Such  a 
distribution  is  called  a  conditional  probability  table  (CPT) 
in  the  case  that  the  domain  of  the  variable  Z  is  a  finite  set. 

3)  For  each  value  node  U,  gu  is  a  value  function  gy  : 
fl7r([/)  — >  R ,  where  R  denotes  the  set  of  the  real  numbers. 

To  avoid  unnecessary  notations,  we  define  the  (optimal) 
policy  concept  only  for  a  simultaneous  ID.1  A  policy,  denoted 
by  5,  specifies  one  action  choice  for  each  decision  node  in  X. 
Hence,  a  policy  6  can  be  denoted  by  (<5i , . . . ,  Sn),  where  Si 
belongs  to  the  domain  of  X,  for  each  i. 

'For  general  IDs,  the  definition  of  an  (optimal)  policy  can  be  found  in. 
e.g.,  [6]. 


Given  a  policy  S,  a  probability  Pg  can  be  defined  over  the 
random  nodes  and  decision  nodes  as  follows: 

ps(y,  x)  =  nYeyp  (f|7t(f))  n  ?=1p4(xo  (i) 

where  P(Y\t:{Y))  is  specified  in  the  definition  of  X,  while 
Ps(Xi)  is  equal  to  1.0  if  Xi  =  Si,  and  0.0  otherwise. 

The  expectation  of  the  value  node  U  under  policy  5 ,  denoted 
by  Eg[U],  is  defined  as 

ES  [U]  M*7))  9u  (tt(EO)  .  (2) 

TV(JJ) 

The  expected  value  Eg  of  X  under  the  policy  6  is  the  sum 
Eg[U]  over  all  value  nodes  U  in  U,  i.e., 

Es=J2e8[U\.  (3) 

UeU 

For  simplicity.  Eg  is  also  called  the  expected  value  of  policy  6. 
Evaluating  a  policy  S  means  to  compute  its  expected  value.  The 
maximum  of  Eg  over  all  policies  is  the  optimal  (expected)  value 
of  1.  An  optimal  policy  is  the  policy  that  achieves  the  optimal 
expected  value.  To  evaluate  an  ID  is  to  find  an  optimal  policy 
and  to  compute  its  optimal  expected  value. 

IV.  The  Factorization  Approach 

In  this  section,  we  describe  the  representation  theorem  and 
the  factorization  approach. 

A.  The  Idea 

From  its  definition,  an  ID  is  a  network  structure  consisting 
of  decision  nodes,  random  nodes,  and  value  nodes.  Among 
them,  in  determining  the  expected  value  of  the  ID,  a  decision 
node  plays  a  different  role  from  a  random  or  a  value  node.  The 
choices  of  a  decision  node  can  affect  the  expected  value  of  the 
ID  through  changing  the  CPTs  of  its  child  random  nodes,  or 
through  changing  the  value  functions  of  its  child  value  nodes 
(note  that  a  decision  node  cannot  have  another  decision  node 
as  child  in  a  simultaneous  ID).  In  this  sense,  a  node,  if  it  is  a 
child  of  a  decision  node,  serves  as  an  interface  through  which 
the  choices  of  decision  nodes  may  affect  the  value  of  the  ID. 
Such  a  node  is  called  an  interface  node.  All  interface  nodes 
constitute  an  interface  set.  Collectively,  an  interface  set  serves 
as  an  interface  of  an  ID  through  which  policies  can  affect  the 
expected  value  of  the  ID.  Consequently,  an  ID  can  be  divided 
into  two  fractions:  the  upstream  fraction,  which  includes  the 
interface  nodes  and  the  nodes  “preceding”  them,  and  the  down¬ 
stream  fraction,  which  includes  the  interface  nodes  and  the 
nodes  “succeeding”  the  interface  nodes. 

Example:  We  use  the  ID  in  Fig.  1  to  informally  illustrate 
these  concepts.  The  ID  has  two  decision  nodes  {Xi,  X2},  five 
random  nodes  {A,  B,  C,  D,  H},  and  one  value  node  U.  The 
interface  set  (fm  is  {A,  C }  since  they  have  parental  decision 
nodes.  The  upstream  is  {Xi,  X2,  A,  B ,  C},  which  consists  of 
two  interface  nodes  A  and  C,  node  Xi  preceding  node  A, 
and  nodes  B  and  X2  preceding  node  C.  The  downstream  is 
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Fig.  1 .  ID  to  illustrate  the  representation  theorem. 

{ A.  C,  //.  I).  U  \,  which  consists  of  interface  nodes  A  and  C, 
the  nodes  H,  D,  U  succeeding  to  the  interface  nodes.  ■ 

Interestingly,  corresponding  to  the  structural  separation  that 
an  ID  can  be  divided  into  two  fractions,  the  expected  value  of 
a  value  node  under  a  policy  breaks  into  two  fractions,  each  of 
which  involving  only  the  upstream  or  the  downstream  of  the  ID. 

B.  The  Theorem 

We  formalize  the  above  idea  in  this  section.  For  the  sake 
of  simplicity,  throughout  the  paper,  unless  explicitly  stated,  we 
assume  that:  1)  the  ID  has  only  one  value  node;  and  2)  the  value 
node  has  no  decision  node  as  its  parent.  We  also  note  that  our 
results  in  this  paper  generalize  to  the  IDs  with  multiple  value 
nodes  and  with  value  nodes  having  parental  decision  nodes.  We 
relax  these  assumptions  at  the  end  of  this  section. 

We  begin  by  defining  several  concepts.  A  random  node  Y 
is  an  interface  node  if  its  parent  set  has  at  least  one  decision 
variable,  i.e.,  7 r(Y)  D  x  7^  0-  The  interface  set  of  an  ID  is  the 
set  of  all  interface  nodes.  Due  to  the  above  assumptions,  the 
interface  set  contains  only  random  nodes;  for  this  reason,  we 
denote  the  set  by  Tin-  The  upstream  of  the  ID  includes  the 
interface  set  and  all  ancestors  of  the  nodes  in  the  interface. 
By  this  definition,  in  addition  to  the  interface  random  nodes 
and  decision  nodes,  the  upstream  may  contain  the  random-node 
ancestors  of  the  interface  nodes.  These  ancestral  nodes  must  be 
included  because  they,  together  with  decision  nodes,  determine 
the  CPTs  of  the  interface  nodes.  These  ancestral  random  nodes 
form  a  set  denoted  by  To  ■ 

Given  an  ID,  we  can  efficiently  identify  its  upstream  using 
a  queuing  mechanism.  We  initialize  a  queue  to  be  the  interface 
set  Tin  (it  can  be  readily  built  by  checking  whether  there  is 
a  parental  decision  node  for  every  node  in  the  ID)  and  the 
upstream  Xup  to  be  empty.  At  each  step,  a  node  is  removed 
from  the  queue  and  added  to  Xup  if  it  is  not  in  Xup.  The  parents 
of  the  node,  if  not  present  in  Xup  thus  far,  are  added  to  the 
queue.  The  procedure  terminates  when  the  queue  is  empty. 
When  it  terminates,  the  set  Xup  becomes  the  upstream  set.  The 
procedure  must  terminate  after  a  finite  number  of  steps  because 
an  ID  is  a  directed  acyclic  graph. 

The  upstream  can  be  partitioned  into  three  sets:  the  set  X 
of  decision  nodes,  interface  set  Tin,  and  the  set  To  of  random- 
node  ancestors  of  interface  nodes.  Given  a  policy  5,  we  define 


a  function  fg  from  f lyin  to  the  real  line  R.  For  notations,  we 
let  m  be  the  number  of  nodes  in  the  set  Tin,  YYm  he  a  short 
notation  of  {YT, . . . ,  YT'},  and  be  an  assignment  to  all 
interface  variables,  i.e.,  an  element  of  ClY.1:rn 

in 

h  (v£m)  =  E  n^Touyin Ps  (T>(D)  n?=1  Ps(Xi)  (4) 

Yey0 

where  HY&yo\jyinPs{Y\'K{Y))TAf=lPg{Xi)  is  the  joint  proba¬ 
bility  distribution  of  the  variables  in  X,  To,  and  Tin,  given 
policy  S.  Hence,  fg(YAm)  is  the  conditional  probability  that 
the  interface  YYm  =  y-u]m  occurs  upon  the  policy  S.  For  con¬ 
venience,  we  call  them  interface  probabilities. 

In  contrast  to  the  upstream,  the  downstream  of  an  ID  is  the 
set  consisting  of  all  the  interface  nodes  and  their  descendants. 
The  downstream  contains  the  value  node,  the  interface  nodes, 
and  the  random  nodes  that  do  not  belong  to  the  upstream. 
We  use  Ti  to  denote  the  set  of  noninterface  random  nodes  in 
the  downstream.  Note  that  the  random  nodes  in  the  interface 
set  belong  to  both  the  upstream  and  the  downstream.  For  an 
assignment  of  the  set  YArn  and  the  value  node  U,  we  can 
define  a  function  as  follows; 

fin,u  (vlr)  =  E  n y^Ps  (FItt(F))  9u  (7 r(J7)) .  (5) 

Yeyi 

To  see  that  f\n,u  is  a  function  of  YAm,  we  note  that  the  inter¬ 
face  variables  may  appear  in  7 r(Y)  for  Y  Ti  ■  Given  an 
assignment  y}fn  of  YYm,  /in, [/(yin™)  is  the  expected  utility 
conditioned  on  the  assigned  interface  t/T™.  These  quantities 
are  called  interface  utilities  for  convenience.  Since  7 r(Y)  for 
Y  in  Ti  must  belong  to  the  downstream,  P,5(F|7r(F))  is  inde¬ 
pendent  of  policy  S.  Consequently,  these  utilities  are  inde¬ 
pendent  of  policy  d. 

Theorem  1:  Given  a  policy  5  and  a  value  node  U,  the  ex¬ 
pected  value  of  the  node  U  under  policy  S 

ES[U}=  E  A  (ytnm)  ■  fin,U  (yin"1)  •  (6) 

,.1:771^0 
y  in 

Proof:  We  show  that  Eg[U\  can  be  rewritten  as  the  sum 
of  the  interface  utility  f\n,u  weighted  by  the  probability  fg  over 
all  interfaces 

ES[U]  =J2P*  (7r(C/))  9u  {<U))  (a) 

n(U) 

=  E  E  n YeyP  (Y\n(Y))  n"=1P^(ATj) 

x  gu  (7 r(P))  (b) 

=  E  E  E  ^Yey0uyinP  (,Y\n(Y)) 

Yeyin  Ycy0  Yey , 

x  n^P  (Y|tt(Y))  n?=i PsiX^gu  (t r(U))  (c) 
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=  E  E  nreyouy„-P(^7i'(i0) 

Yeyin 


x 


E  nmiP(Fk(F))Si;Wt/)) 

. Yey 1 


=  E  f*  (^inm)  •  f^,U  (ytrT)  ■ 

ytrenYl:m 

in 


(d) 

(e) 


Step  (a)  is  true  by  (2).  At  step  (b),  y/Tt(U)  is  the 
difference  set  of  y  and  ir(U).  This  step  is  true  by  in¬ 
serting  (1)  into  (2).  Step  (c)  follows  from  the  fact  that 
{Tr(U),y  /n(U)}  and  {^o,  Tin,  Ti}  are  two  partitions  of  the  set 
y.  At  step  (d),  we  break  the  distribution  IIyg;pP(Y|7r(Y))  x 
n i=1Pg(Xi)  into  two  fractions  nygyoU;yinP(y|7r(F))  and 
nygyi  P(y|7r(y))g(y(7r(f7)).  At  step  (e),  we  replace  the 
two  fractions  with  the  definitions  of  the  interface  utilities 
and  interface  probabilities.  ■ 

By  the  theorem,  given  a  policy  S  and  a  value  node  U,  the 
expected  value  of  the  node  under  the  policy  can  be  represented 
as  the  sum  of  the  multiplications  of  the  interface  utilities  and 
corresponding  interface  probabilities. 

Example  (Continued):  For  the  ID  in  Fig.  1,  we  show  how  to 
represent  Eg  [U]  for  the  value  node  U  and  a  given  policy  S.  Let 
the  policy  S  be  (5i,  S2),  where  Si  is  the  decision  choice  of  Xi 
for  i  =  1,2.  By  definition 


ES[U]  =  E  E  P(A\Sl52)P(B\A)P(C\BS2) 

H  ABCD 

x  PiDlQPiHlAD^tiPsiXJguiH). 


The  two  functions  are  defined  as  follows. 

fs(A,  C)=J2  P{A\8152)P(B\A)P(C\B52)B2^1Ps(Xl ) 

B 

fin,u(AC)  =  '£,P{H\AD)P(D\C)gu{H). 

HD 

It  can  be  verified  that  Eg[U]  =  Xm,c  /{(A  C)-  /in,t/(A  C)M 
We  examine  the  assumptions  we  made  at  the  beginning  of 
this  section.  First,  we  have  assumed  that  there  is  only  one 
value  node.  In  case  of  multiple  value  nodes,  we  may  apply  the 
representation  theorem  to  each  node.  The  expected  value  of  a 
policy  is  the  additive  sum  of  the  expected  values  of  all  value 
nodes  under  the  policy. 

Second,  we  have  assumed  that  the  value  node  has  no  decision 
nodes  as  its  parents.  In  the  other  case  that  the  value  node  has  a 
decision  node  as  its  parents,  the  functions  fstu  and  /in  can  be 
defined  as  follows,  such  that  the  theorem  holds 

fs,u  (Lnm)  =  E  n^Tou^n  A  (Tk(F)) 

Yey0 

xn :=1Ps(Xt)gu  (7 r(C/)) 

/in(Li:m)  =  E  nY£yiPS(Y\n(Y)). 

Yeyi 


TABLE  I 

Factorization  Approach  to  ID  Evaluation 


1.  Pre-compute  the  quantities  /m,c/(2/L,m)  f°r  all  v}™  in  flyum 

2.  For  each  policy  <5 

2.1  compute  fsivlC11)  for  each  assignment  of  yPm 

2.2  compute  Eg  [[/]  by  Equation  (6) 

3.  Return  the  policy  that  maximizes  Eg  [U] 


Therefore,  we  can  lift  the  assumption  that  the  value  node  has 
no  decision  node  as  parents.  In  this  case,  U  is  also  called  an 
interface  node,  but  it  belongs  to  the  upstream  only.  The  reason 
is  that,  by  definition,  a  value  node  cannot  have  children  and 
therefore  cannot  produce  impact  on  the  downstream.2  Note 
that  fg  changes  to  fs,u,  since  the  value  node  is  considered  in 
computing  the  quantities  relevant  to  the  upstream.  Interestingly, 
it  can  be  proven  that  /;„(=  1.0)  is  a  constant.  To  see  why,  let 
us  assume  that  the  size  of  y±  be  k.  We  enumerate  the  set  Ti  as 
{Yi1, . . . ,  Y-[  }  such  that  a  node’s  parents  appear  after  the  node 
in  the  set.  In  computing  /in(Yj2:m),  we  can  sequentially  sum 
out  the  variables  in  Ti  in  the  enumerated  order.  Ultimately,  we 
have  /in  =  1.0. 

C.  The  Algorithm 

By  the  representation  theorem,  the  expected  value  of  a  policy 
is  represented  as  the  sum  of  interface  utilities  weighted  by  the 
corresponding  interface  probabilities.  The  interface  utilities  are 
independent  of  the  individual  policies,  whereas  the  interface 
probabilities  are  dependent  on  the  policies.  Therefore,  the  in¬ 
terface  utilities  can  be  factored  out,  i.e.,  they  can  be  calculated 
once  and  reused  across  all  the  policies. 

This  is  the  idea  behind  our  factorization  approach,  which 
is  described  in  Table  I.  The  factored-out  computations  are 
calculated  once  at  line  1.  They  are  used  for  all  policies  at  line 
2.2.  Note  that  the  procedure  generalizes  to  IDs  with  multiple 
value  nodes. 

I).  Complexity  Analysis 

It  is  of  interest  to  compare  the  approach  and  the  generic 
brute-forced  approach  that  evaluates  a  policy  directly  by  com¬ 
bining  (1)  and  (2).  Let  n  be  the  number  of  decision  nodes. 
Thus,  the  size  of  the  policy  space  is  2n.  Let  the  complexity  of 
evaluating  one  policy  be  C.  The  complexity  C  breaks  into  three 
pieces:  computing  fg,  computing  fin,u,  and  computing  Eg[U] 
by  (6).  We  denote  them,  respectively,  by  C\,  C3,  and  C2.  To 
evaluate  all  policies,  the  generic  approach  has  complexity  2 nC, 
i.e.,  2n(Ci  +  C2  +  C3).  In  contrast,  since  the  factorization 
approach  computes  f\„jj  only  once  but  uses  them  2"  times,  its 
complexity  is  2n(C\  +  C2)  +  C3.  A  good  measure  to  predict 
the  computational  gain  is  the  size  of  the  downstream,  i.e., 
the  number  of  nodes  in  it.  In  one  extreme,  if  the  downstream 
contains  only  the  interface  nodes  and  the  value  nodes  (thus,  Cg 
is  a  constant),  the  two  approaches  have  the  same  complexity.  In 
the  other  extreme,  if  the  downstream  contains  far  more  nodes 

’Note  that  this  is  different  from  random  nodes  having  parental  decision 
nodes.  Such  a  random  node  belongs  to  both  the  upstream  and  the  downstream. 
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Fig.  2.  Dynamic  ID  model. 

than  the  upstream  (i.e.,  C3  C\  +  C2),  the  computational  empirically  show  that  these  bounds  are  reasonably  tight  for  the 
gain  is  significant.  tested  problems. 


E.  Bounding  the  Optimal  Expected  Value 

We  show  that  the  interface  utilities  computed  in  the  factor¬ 
ization  approach  can  be  used  to  derive  both  an  upper  and  a 
lower  bound  of  the  optimal  value  of  the  ID.  These  bounds  have 
significant  implications  in  practical  planning. 

We  define  /+ (J  and  /r  l;  to  be  the  largest  and  the  smallest 
one  among  all  interface  utilities,  i.e., 

max  /in, u  (ylnm) 
y^menYi:m 

in 

min  fin,u(ylnm) 

Vin  e“y.1:m 

in 

where  the  max  and  min  are  taken  over  the  domain  of  the 
variables  in  Yi^m.  From  the  theorem,  we  see  that  /+  v  (/r  v) 
is  the  upper  (lower)  bound  of  the  optimal  value  of  the  ID.  These 
bounds  have  significant  importance  in  practice.  Suppose,  for 
instance,  that  these  bounds  are  available  to  a  planner.  In  one 
extreme,  if  the  planner  expects  a  utility  that  is  larger  than  the 
upper  bound,  he  never  bothers  to  evaluate  all  the  policies  and 
finds  the  optimal  one  because  even  the  best  policy  provides  less 
than  he  expects.  In  this  case,  he  needs  to  redesign  the  network 
structure  or  parameters  such  that  the  performance  of  the  ID 
can  be  improved.  In  the  other  extreme,  if  the  planner  expects  a 
utility  that  is  less  than  the  lower  bound,  again  he  never  bothers 
to  evaluate  all  the  policies  and  chooses  the  optimal  one  because 
any  policy  can  provide  more  than  he  expects.  In  this  case,  he 
can  pick  any  policy  and  execute  it. 

We  note  that  from  the  computational  point  of  view,  comput¬ 
ing  these  bounds  is  easier  than  evaluating  the  ID.  There  are 
two  reasons.  First,  as  discussed  earlier,  computing  these  bounds 
involves  only  the  downstream  of  the  ID,  whereas  evaluating 
an  ID  involves  its  entire  structure.  If  the  downstream  contains 
much  fewer  variables  than  the  upstream,  the  interface  utilities 
(and  also  the  bounds)  can  be  obtained  efficiently.  Second, 
computing  these  bounds  avoid  enumerating  all  the  policies  and 
calculating  their  expected  values. 

Finally,  the  tightness  of  the  bounds  depends  on  the  structure 
of  an  ID,  the  CPTs  of  random  nodes,  and  the  value  functions  of 
value  nodes.  It  is  difficult  to  characterize  a  general  condition  to 
determine  the  tightness  of  the  bounds.  In  our  experiments,  we 


V.  Extensions  to  the  Factorization  Approach 

In  this  section,  we  discuss  two  extensions  to  the  factoriza¬ 
tion  approach.  These  extensions  deal  with  reconstructing  the 
policies  as  the  network  structure/parameters  undergo  changes. 
There  are  two  perspectives.  First,  at  one  decision  step,  the 
network  might  change  such  as  more  actions  being  available  for 
a  planner’s  choice,  more  value  nodes  needed  consideration,  and 
so  on.  Second,  the  network  might  dynamically  alter  its  structure 
or  parameters  as  time  goes  by.  For  example,  if  a  subgoal  is 
successfully  accomplished  at  one  step,  it  can  be  removed  from 
the  network  in  the  subsequent  steps. 

A.  Network  Structure/Parameter  Changes 

The  principle  for  the  factorization  approach  to  accommo¬ 
date  structure  or  parameter  changes  is  as  follows.  First,  if 
the  changes  involve  only  the  upstream  of  an  ID,  the  inter¬ 
face  utilities  do  not  need  to  be  recomputed  and  can  still  be 
shared  in  evaluating  the  ID.  Specifically,  these  changes  include 
addition  or  removal  of  decision/random/value  nodes  and  also 
the  alternation  of  CPTs  and  value  functions  in  the  upstream. 
Second,  if  the  changes  involve  only  the  downstream  of  the 
ID,  the  approach  needs  to  reconstruct  the  interface  utilities. 
Fortunately,  the  interface  probabilities  are  preserved  and  the 
calculations  for  them  can  be  saved.  Third,  if  the  changes  involve 
not  only  the  upstream  but  also  the  downstream,  the  approach 
needs  to  recompute  both  the  interface  utilities  and  the  interface 
probabilities. 

B.  Planning  Over  Time 

In  realistic  applications,  network  parameters  may  change 
over  time.  In  this  case,  we  can  use  a  dynamic  ID  to  model 
the  conditional  dependencies  among  nodes  over  time.  In  this 
section,  we  show  how  the  factorization  approach  can  be  used  to 
reconstruct  the  policies  on  a  step-by-step  basis  for  dynamic  IDs. 

To  facilitate  our  discussions,  we  extend  the  example  in  Fig.  1 
to  a  dynamic  ID.  We  assume  that  the  variable  H  evolves  over 
time  and  let  Ht  denote  H  at  step  t.  The  dynamic  ID  is  drawn  in 
Fig.  2.  In  contrast,  the  ID  in  Fig.  1  is  said  to  be  static  since  the 
multiple  decisions  are  made  at  one  time  step. 


fiL,u  — 

f in,t/ 
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The  dynamic  ID  has  two  prominent  features.  First,  at  a  single 
step,  the  decision  problem  can  be  modeled  as  a  static  ID.  In 
addition  to  the  nodes  and  links  in  Fig.  1,  the  node  Ht+ 1  at 
step  t  +  1  has  one  more  parent  node  II f  .  Second,  the  intertem¬ 
poral  link  between  two  consecutive  nodes  carries  the  historic 
information  about  the  sequence  of  performed  policies.  For  step 
n  +  1,  the  information  can  be  summarized  by  a  probability 
distribution  of  Ht  conditioned  on  the  history  [14]. 

For  a  dynamic  ID,  we  are  interested  in  optimal  planning  on 
the  step-by-step  basis.  The  problem  is  formulated  as:  Given 
an  initial  probability  distribution  P(Hq),  at  step  t  +  1,  how 
to  efficiently  find  the  policy  <)'[  =  (<$i, . . . ,  Sn )  where  n  is  the 
number  of  decision  nodes]  that  maximizes  Eg  [(///, +  l }  ?  To  solve 
the  problem,  we  show:  1)  how  to  select  the  optimal  policy  at 
step  1  +  1,  given  the  probability  distribution  and  2)  how 

to  sequentially  update  the  probability  distribution  P(Ht+ 1) 
from  P(Ht),  given  a  policy  6  at  the  previous  step.  After  these 
two  questions  are  settled,  we  may  choose  the  optimal  policy  as 
follows.  At  step  t  +  1,  we  first  choose  the  optimal  policy  for 
the  step  and  then  update  the  probability  P(Ht+i)  from  P(Ht). 
The  procedure  repeats  at  each  step. 

To  answer  the  first  question,  we  introduce  the  concept  of 
an  augmented  interface  node  and  an  augmented  interface  set. 
We  call  the  node  Ht  an  augmented  interface  node  of  the  ID  at 
step  t  +  1  since  the  node  Ht  can  produce  impact  on  the  network 
via  altering  its  probability  distribution.  In  this  sense,  it  is  an 
interface  node.3  The  augmented  interface  set  consists  of  34i  as 
before  and  the  node  Ht.  The  downstream  of  the  ID  at  step  t  +  1 
remains  the  same  as  that  of  the  static  ID.  Likewise,  we  may 
define  the  two  functions  fg  and  .f\n.Unl  +  l  ■  Therefore,  we  can 
use  the  factorization  approach  to  solve  the  planning  problem 
over  time.  The  computations  involving  are  factored 

out.  Note  that  these  interface  utilities  are  shared  for  all  policies 
at  each  decision  step.  For  the  ID  in  Fig.  2,  we  can  define  the 
following  functions  for  the  ID: 

fs(A,C,Ht ) 

=  P(Ht)  P(.A\5152)P(B\A)P(C\B52)H2i=1Ps(Xi) 

B 

fin,UHt  (A,  C,  Ht) 

=  P(Ht+1\ADHt)P(D\C)gu(Ht+1). 

Ht+iD 

It  can  be  verified  that  Es[UHt+1]  =  Y^A,c,Ht  fs(A,C,Ht )  • 

fin,UHt(A,C,Ht). 

To  answer  the  second  question,  we  show  how  to  efficiently 
compute  P(7Tt+i),  given  a  distribution  P(Ht)  and  the  policy 
6  performed  at  step  t.  We  introduce  a  technique  such  that  the 
procedure  of  computing  P(Ht+i)  can  be  conducted  similar  to 
that  of  computing  Eg[U Ht+1]-  Suppose  that  Ht+\  can  take  on 
two  values  h( true)  and  -i/i(false).  We  first  show  how  to  calculate 

3  Previously,  we  defined  an  interface  node  to  be  a  node  that  has  parental 
decision  nodes  since  the  choices  of  decision  nodes  can  affect  its  CPTs  and 
in  turn,  the  expected  value  of  the  ID.  In  contrast,  the  node  Ht  is  called  an 
augmented  interface  node  since  it  can  change  its  probability  distribution  and 
thus,  affect  the  expected  value  of  the  ID. 


the  probability  of  Ht+i  being  true.  Let  V  be  a  value  node  that 
differs  from  Unt+1  only  in  its  value  function.  Specifically,  <jy  is 
1.0  if  its  parent  Ht+ 1  is  true;  it  is  0.0  otherwise.  For  simplicity, 
let  Ht+ 1  =  h(—>ti)  denote  the  event  that  the  hypothesis  Ht+i  is 
true  (false).  We  prove  that  £7,5  [V]  =  Pg(Ht+i  =  h). 

Proposition  1:  £7,5 [V]  =  Pg(Ht+ 1  =  h). 

Proof: 

Es[V }  =  Ps(Ht+1)gv(Ht+1) 

Ht+ 1 

=  Ps(Ht+1  =  h)gv(Ht+1  =  h) 

+  Ps{Ht+ 1  =  ~'h)gv{Ht+ 1  =  -i  h) 

=  Ps{Ht+1  =  h). 

In  the  last  step,  we  use  the  definition  of  the  value  function  gyM 

To  calculate  the  probability  of  Ht+\  being  false,  we  may 
define  gy  as  follows:  It  is  1.0  if  its  parent  Ht+i  is  false;  it  is 
0.0  otherwise.  If  we  define  two  functions  fg  and  f\n,v,  we  see 
that  the  computational  steps  for  Eg  [V]  are  the  same  as  those  for 
computing  Eg[U Ht+1\-  Hence,  computing  P(Ht+i)  does  not 
add  much  overhead  to  ID  evaluation. 

It  is  interesting  to  compare  the  generic  approach  and  the 
factorization  approach  in  the  context  of  the  dynamic  ID.  Let 
the  number  of  decision  steps  be  T.  Recall  that  the  complexity 
of  computing  fg  is  C i,  the  complexity  of  computing  f\n,UnL  , 
is  C3,  and  the  complexity  of  computing  Eg[UHt+1]  is  C2.  Since 
C-i  takes  constant  time,  it  can  be  ignored.  For  one  decision 
step,  the  factorization  approach  has  the  complexity  2nC2  +  C3 
while  the  generic  approach  has  the  complexity  2"(G2  +  Cf). 
For  T  steps,  the  complexity  of  the  factorization  approach  is 
2nTC2  +  C3  [note  that  this  does  not  include  the  overhead 
of  computing  P(iTt+i)],  while  the  complexity  of  the  generic 
approach  is  2nT(C2  +  C3).  If  C3  ^  C2,  the  factorization 
approach  can  be  extremely  efficient. 

VI.  Experiments 

In  this  section,  we  report  our  experiments  on  both  simulation 
studies  and  a  military  planning  example.  In  our  experiments, 
we  wrote  Matlab-V6.5  codes  and  ran  them  on  a  laptop  with  a 
2.0-GHz  central  processing  unit  (CPU)  under  Windows  XP.  We 
compare  the  factorization  approach  against  the  generic  brute- 
forced  approach.  We  chose  the  generic  approach  because  we 
were  not  aware  of  specific  algorithms  for  evaluating  simulta¬ 
neous  IDs.  For  convenience,  we  refer  to  the  two  algorithms 
as  evalCS  (named  after  computation  sharing)  and  evalBF 
(named  after  brute  forced). 

A.  Simulation  Studies 

To  thoroughly  evaluate  the  performance  of  the  factorization 
approach,  we  conducted  simulated  studies  on  the  ID  in  the 
left  chart  of  Fig.  3,  which  is  similar  to  the  military  planning 
examples  in  [15].  It  is  referred  to  as  the  static  ID  in  the  rest 
of  this  section.  The  CPTs  are  randomly  generated.  The  value 
functions  for  value  nodes  are  manually  specified. 

Specifically,  our  experiments  are  designed  to:  1)  evaluate 
the  performances  of  the  factorization  approach  for  static  and 
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Fig.  3.  Test  example  is  shown  in  the  left  chart,  while  the  right  is  its  variant  for  comparative  studies. 
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Number  of  decision  variables 


Fig.  4.  Performance  comparison  of  evalBF  and  evalCS. 

dynamic  IDs;  2)  show  the  tightness  of  bounds  derived  from 
the  interface  utilities;  3)  demonstrate  how  the  computational 
gain  achieved  by  the  approach  varies  with  different  network 
structures;  and  4)  demonstrate  the  computational  gain  by 
adapting  the  approach  to  account  for  newly  added  decisions 
and  value  nodes. 

1)  Performances  of  the  Factorization  Approach:  To  see  how 
the  performances  of  the  algorithms  vary  with  the  number  of 
decision  nodes,  we  fix  the  number  of  random  nodes  at  each 
level  at  four  and  vary  the  number  of  decision  nodes.  Thus,  the 
static  ID  with  n  decision  nodes  has  additionally  ten  random 
nodes  and  n+1  value  nodes.  We  ran  evalBF  and  evalCS 
for  seven  problems  with  n  =  3,5,...,13.  The  timing  data  are 
presented  in  the  left  chart  of  Fig.  4.  The  chart  gives  the  total 
CPU  seconds  that  the  algorithms  took  for  each  of  the  problems. 
Note  that  the  vertical  direction  is  drawn  in  log  scale.  The 
solid  (dashed)  curve  is  for  evalCS  (evalBF).  It  can  be  seen 
that  evalCS  is  considerably  more  efficient  than  evalBF.  For 
instance,  from  our  data,  for  n  =  9,  to  evaluate  512  policies. 


evalCS  took  3.51  s  while  evalBF,  646.74  s;  for  n  =  13,  to 
evaluate  8192  policies,  evalCS  used  46.32  s  while  evalBF, 
9284.05  s. 

To  quantitatively  characterize  how  much  savings  the  factor¬ 
ization  procedure  can  bring  about,  we  use  the  timing  results 
of  evalCS  to  predict  the  performance  of  evalBF.  Recall 
that  the  complexity  of  evaluating  a  policy  breaks  into  three 
fractions  C\,  C3,  and  ('2 ■  We  ignore  Ci  since  it  is  a  constant. 
For  each  problem,  we  estimate  C3  by  the  actual  seconds  C3 
of  computing  ftn.uH,  and  C2  by  C2  as  (the  total  CPU  time  — 
Cf)/ (the  number  of  policies).  The  complexity  of  evalBF  is 
predicted  by  2n(C-2  +  C3).  We  found  that  these  estimations  are 
almost  the  same  as  the  actual  timing  results  of  evalBF.  This 
suggests  the  effectiveness  of  our  complexity  analysis. 

We  also  tested  the  algorithms  over  the  dynamic  ID  in  Fig.  5, 
which  is  an  extension  of  the  left  chart  of  Fig.  3. 

Initially,  the  probability  of  the  node  H  being  false  is  set  to 
1.0.  Its  probability  is  updated  at  each  decision  step.  We  ran 
both  algorithms  for  up  to  ten  decision  steps.  We  showed  the 
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Fig.  5.  Dynamic  influence  diagram. 


total  CPU  time  for  the  ID  with  nine  decision  nodes  in  the  right 
chart  of  Fig.  4.  The  chart  gives  the  total  CPU  seconds  for  both 
algorithms  against  the  time  steps.  Note  that  again  the  vertical 
direction  is  drawn  in  log  scale.  It  can  be  seen  that  the  CPU 
time  linearly  increases  with  the  elapsed  time  for  evalBF  while 
its  increase  is  negligible  for  evalCS.  This  is  not  a  surprising 
observation.  In  evalBF,  all  policies  are  evaluated  at  each  step. 
The  time  cost  for  all  steps  remains  the  same.  Hence,  the  increase 
is  linear.  From  our  data,  evalBF  uses  about  1300  s  to  evaluate 
all  512  (29)  policies  at  each  step.  However,  in  evalCS,  the 
interface  utilities  are  computed  only  once  at  the  first  step.  So, 
we  observe  that  at  the  first  step,  evalCS  takes  about  2.66  s 
to  compute  these  utilities;  thereafter,  each  step  takes  about  only 
10  s  to  evaluate  all  512  policies.  The  increase  is  negligible  when 
compared  against  that  in  evalBF. 

2)  Tightness  of  Bounds:  To  show  the  tightness  of  the  upper 
and  lower  bounds  of  the  optimal  expected  value,  in  Fig.  6,  we 
plot  the  optimal  value  (the  middle  curve)  and  these  bounds 
(the  upper  and  lower  curves)  for  the  static  IDs  with  3,  5,  . .  .,13 
decision  nodes.  We  see  that  these  bounds  are  reasonably  tight 
for  the  tested  problems.  For  example,  for  n  =  7,  the  optimal 
value  is  871.47  while  the  bounds  are  722.91  and  936.637. 
Although  it  is  difficult  to  quantitatively  analyze  the  properties 
of  these  bounds,  these  experiments  show  they  can  be  tight  at 
least  for  these  tested  problems. 

3 )  Computational  Gain  Under  Network-Structure  Changes: 
To  demonstrate  how  the  computational  gain  of  evalCS  varies 
with  different  network  structures,  we  run  evalCS  over  the 
static  ID  and  a  modified  version  of  it.  The  modified  ID  is 
obtained  as  follows:  Every  link  from  X,  to  Yj  is  redirected 
to  Y?.  The  resulting  ID  is  shown  in  the  right  chart  of  Fig.  3. 
Its  upstream  is  X  U  Y^.mi  U  U  U\-n,  whereas  its  down¬ 
stream  is  Yf!  U  {H}  U  {Uh},  where  Ui:n  means  the  set 
of  value  nodes,  and  Yfm.  means  the  set  of  nodes  Yj,  i.e., 
Y{:m.  =  {Yf  . . . ,  Yf. }  for  i  =  1,  2,  3.  Compared  with  that  of 


Number  of  decision  variables 

Fig.  6.  Lower  and  upper  bounds  obtained  from  the  interface  utilities. 


the  static  ID,  the  downstream  of  the  modified  ID  contains  fewer 
random  nodes.  We  expect:  1)  evalCS  is  still  more  efficient 
than  evalBF  in  the  modified  ID,  since  its  downstream  contains 
a  number  of  random  nodes;  and  2)  evalCS  achieves  less 
savings  in  modified  ID  than  it  does  in  the  original  ID. 

The  experiments  presented  in  Fig.  7  confirm  these  expecta¬ 
tions.  First,  the  left  chart  plots  the  CPU  seconds  (in  log  scale) 
that  evalCS  and  evalBF  take  for  the  modified  IDs  with  a  dif¬ 
ferent  number  of  decision  variables.  It  can  be  seen  that  evalCS 
is  more  efficient.  Second,  the  right  chart  plots  the  magnitudes 
of  the  savings  brought  by  evalCS.  For  each  approach,  the 
saving  magnitude  is  measured  by  the  quotient  of  the  total  time 
of  evalBF  and  that  of  evalCS.  The  magnitudes  are  drawn  in 
the  vertical  direction.  For  a  modified  ID,  evalCS  is  about  14 
times  faster  than  evalBF.  For  the  original  IDs,  the  magnitudes 
vary  with  the  number  of  their  decision  nodes.  We  see  that  the 
computational  savings  brought  by  evalCS  are  more  significant 
for  IDs  whose  downstream  contains  more  nodes. 
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Fig.  7.  Computational  gains  versus  network  structure. 
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Fig.  8.  Replanning  for  added  action/value  nodes.  In  both  charts,  the  curves  from  the  top  and  bottom  plot  the  total/replanning  time  for  the  generic  approach,  and 
the  total/replanning  time  for  the  factorization  approach. 


4)  Computational  Gain  Under  Network-Parameter 
Changes:  We  also  conducted  experiments  to  show  how 
the  factorization  approach  achieves  computational  savings  as 
the  network  changes.  For  this  purpose,  given  the  static  ID, 
we  first  evaluate  it  (the  planning  phase),  then  add  more  nodes 
to  the  ID  and  reevaluate  it  (the  replanning  phase).  We  like 
to  compare  both  the  replanning  time  and  total  time  of  the 
factorization  approach  against  that  of  the  generic  approach. 

In  one  experiment,  we  first  evaluate  a  static  ID  with  n 
decision  nodes.  We  then  add  two  decision  nodes  to  the  ID  and 
evaluate  the  modified  ID.  Every  newly  added  node  has  a  link 
from  itself  to  every  Y-  node.  The  timing  results  in  log  scale 
are  presented  in  the  left  chart  of  Fig.  8.  In  the  chart,  the  curve 
corresponding  to  CS1  (BF1)  depicts  the  replanning  time  for 
the  factorization  (generic)  approach,  whereas  the  curve  corre¬ 
sponding  to  CS  (BF)  depicts  the  total  time  similarly.  We  see 
that  for  the  tested  problems,  the  factorization  approach  achieves 
considerable  savings  in  replanning  when  more  decision  nodes 
are  added.  For  instance,  for  the  ID  with  nine  decision  nodes, 
the  factorization  and  generic  approach,  respectively,  takes  9.80 
and  2313.94  s.  These  savings  are  achieved  through  sharing 
the  interface  utilities  computed  during  the  evaluation  of  the 
original  ID.  Since  the  factorization  approach  takes  much  less 
time  in  both  evaluating  and  reevaluating  the  ID,  its  total  time  is 
considerably  less  than  that  used  by  the  generic  approach. 

In  the  other  experiment,  we  evaluate  the  ID  and  then  add  one 
more  value  node  for  replanning.  The  added  value  node  has  a 


link  from  every  node  Y‘2  for  j  =  1, . . . ,  4.  In  computing  the 
expected  value  of  the  added  value  node,  we  still  use  the  factor¬ 
ization  approach.  In  reevaluating  an  ID,  we  do  not  recompute 
the  expected  value  of  the  existing  value  node.  The  timing  results 
in  log  scale  are  collected  in  the  right  chart  of  Fig.  8.  The  legends 
read  similar  to  those  in  the  left  chart.  It  can  be  seen  that  the 
factorization  approach  can  achieve  great  savings  in  replanning. 
The  reason  is  obvious:  The  factorized  computations  are  saved 
in  computing  the  expected  value  of  the  newly  added  value  node. 
By  taking  advantage  of  shared  computations  in  evaluating  two 
value  nodes,  the  total  time  used  by  the  factorization  approach  is 
considerably  less  than  that  by  the  generic  approach. 

B.  A  Military  Planning  Example 

We  applied  the  factorization  approach  to  a  hypothetical 
military  planning  example,  which  is  illustrated  in  Fig.  9. 
The  overall  military  goal  is  to  win  a  war  or  to  bring  a 
tyrant  to  justice.  The  goal  is  represented  by  a  Hypothesis 
node,  which  is  on  the  top  of  the  figure.  There  are  12  prim¬ 
itive  actions,  namely  destroy_C2,  destroy_Radars, 
. . .,  operate_special_f orce,  which  are  on  the  bottom 
side.  Performing  an  action  has  direct  effects  of  specific  pur¬ 
pose.  For  instance,  if  the  action  destroy_Radars  is 
performed,  the  EW/GCIRadars  is  destroyed  with  a  high 
probability.  These  effects  alter  the  overall  goal  through 
altering  the  low-level  subgoals.  For  instance,  the  status 
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Legends: 

IADS:  integrated  air  defense  system 
C2:  Command  &  control 

EW/GCI:  early  waming/ground  control  interception 


Fig.  9.  Static  ID  illustrating  a  military  planning  problem. 


of  C2  (command  and  control),  EW/GCIRadars  and 
Communications  facilities,  and  Air_strike  determine 
the  workability  of  the  integrated  air  defense  system  (IADS) 
and  the  strength  of  the  enemy  air  force.  In  turn,  the 
workability  of  IADS  system  and  the  strength  of  the  en¬ 
emy  air  force  determine  the  loss  of  Air_superiority. 
The  Air_superiority,  Territory_occupation,  and 
Commander_sur render  are  three  subgoals  determining 
the  overall  goal  success.  Without  loss  of  generality,  we  as¬ 
sume  all  nodes  are  binary.  In  the  example,  each  decision  node 
is  associated  with  a  value  node  encoding  the  cost  of  perform¬ 
ing  the  action,  and  the  hypothesis  node  is  associated  with  a 
value  node  encoding  the  utility  of  goal  success.  The  optimal 
policy  needs  to  balance  the  utility  of  goal  success  and  cost  of 
performing  actions. 

We  designed  a  reasonable  set  of  CPTs  and  value 
functions.  For  the  Hypothesis  node,  if  all  the  subgoals 
Air_superiority,  Territory_occupation,  and 
Command_surrender  are  achieved,  the  overall  goal 
is  successfully  achieved.  If  one  of  the  subgoals  is  to  be 
achieved,  the  probability  of  the  overall  success  is  decreased 
by  0.3;  however,  if  none  of  the  subgoals  is  achieved,  the 
overall  goal  fails  with  certainty.  Similarly,  for  the  subgoal 
Air_superiority,  the  two  influencing  factors  are  IADS 
and  Air_force.  If  the  IADS  system  works  well  and 
Air_force  is  strong,  Air_superiority  is  true  for  the 
enemy  air  force;  if  either  the  IADS  system  works  poorly  or 
Air_f  orce  is  weak,  the  probability  of  Air_super  iority 
being  true  is  decreased  by  0.5.  Other  CPTs  for 


Territory_occupation  and  Command_surrender 
are  set  analogously  to  those  for  Air_superiority.  A 
similar  strategy  is  used  in  parameterizing  the  nodes  IADS, 
Air_force,  Artillery,  Ground_f orce,  Morale,  and 
Commander_in_custody.  In  determining  the  CPTs  for 
random  nodes  that  are  immediate  children  of  the  decision 
nodes,  we  assume  that  an  action  achieves  its  intended  effect 
with  probability  0.9.  For  example,  a  destroy_Radars 
decision  will  destroy  the  EW/GCI  radars  with  probability 
0.9.  To  complete  the  ID  definition,  we  also  assigned  value 
functions.  If  the  goal  is  successfully  achieved,  the  reward  is 
1000;  otherwise,  the  cost,  i.e.,  a  negative  reward,  is  500.  For 
other  decision  nodes,  if  a  ground  attack  is  launched,  the  cost 
is  150;  if  the  special  force  operation  is  performed,  its  cost  is 
100;  if  the  commander  decides  to  capture  the  bodyguards  of 
the  tyrant,  the  operating  cost  is  80;  if  an  air  strike  is  launched, 
the  cost  is  50;  for  any  other  actions,  their  operating  cost  is  20. 

Our  primary  interest  is  in  the  performance  of  the  factorization 
algorithm.  From  our  data,  to  evaluate  the  ID,  the  factorization 
algorithm  took  45  s,  while  the  brute-forced  algorithm  took 
9012  s.  Hence,  the  computational  saving  is  tremendous.  We 
can  explain  the  performance  difference  by  the  ID  structure — its 
downstream  contains  a  large  number  of  nodes:  all  random 
nodes  and  the  value  node  associated  with  the  goal.  Since  its 
downstream  contains  far  more  random  nodes  than  its  upstream, 
the  approach  is  expected  to  be  significantly  more  efficient. 
Our  secondary  interest  is  concerned  with  the  optimal  policy. 
The  optimal  policy  is  the  one  that  performs  only  air  strike 
and  special  force  operation.  The  expected  value  of  the  ID 


12 


IEEE  TRANSACTIONS  ON  SYSTEMS,  MAN,  AND  CYBERNETICS— PART  A:  SYSTEMS  AND  HUMANS 


is  561.98,  and  the  probability  of  goal  success  is  0.81.  We  note 
that  the  optimal  policy  excludes  “launching  a  ground  attack,” 
although  it  is  the  action  that  is  most  likely  to  lead  to  goal 
success.  One  possible  reason,  as  explained  earlier,  is  that  the 
action  is  excluded  due  to  the  high  operating  cost  of  performing 
the  action. 

VII.  Conclusion  and  Future  Work 

In  this  paper,  we  studied  a  special  ID  class,  namely  simul¬ 
taneous  IDs,  where  multiple  decisions  need  to  be  made  at 
one  time  step.  We  intended  to  make  two  contributions.  First, 
we  examined  a  simultaneous  ID  and  studied  its  theoretical 
properties.  We  showed  that  such  an  ID  can  be  decomposed 
into  an  upstream  fraction  and  a  downstream  fraction,  and  that 
the  expected  value  of  a  value  node  under  a  policy  can  be 
represented  as  the  sum  of  interface  utilities  that  involve  only  the 
downstream  fraction,  weighted  by  the  corresponding  interface 
probabilities  that  involve  only  the  upstream  fraction.  The  inter¬ 
face  utilities  naturally  provide  an  upper  and  lower  bound  of  the 
optimal  value  of  the  ID.  Second,  we  proposed  a  novel  factoriza¬ 
tion  algorithm  to  evaluate  a  simultaneous  ID.  The  interface  util¬ 
ities  are  independent  of  the  individual  policies;  therefore,  they 
can  be  calculated  once  but  used  across  all  policies  in  evaluating 
them.  We  also  extend  the  factorization  approach  to  a  dynamic 
ID.  The  algorithm  has  been  tested  on  simulation  studies  and 
a  military  planning  example.  Our  experiments  showed  that  the 
factorization  algorithm  is  significantly  more  efficient  than  the 
generic  algorithm  in  evaluating  a  simultaneous  ID. 

To  further  speed  up  ID  evaluating,  one  future  direction  is 
to  combine  the  factorization  approach  with  the  approaches 
of  reducing  the  search  space.  In  this  paper,  we  address  one 
difficulty  in  ID  evaluation,  i.e.,  evaluating  individual  policies. 
Another  difficulty  in  ID  evaluation  is  that  the  policy  space 
contains  exponentially  many  polices  and  one  needs  to  evaluate 
all  of  them  in  order  to  find  the  optimal  one.  The  ID  evaluation 
process  can  be  accelerated  if  the  technique  in  this  paper  can  be 
integrated  with  the  approaches  of  reducing  the  search  space. 
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In  an  influence  diagram  (ID),  value-of-information  (VOI)  is  defined  as  the  difference 
between  the  maximum  expected  utilities  with  and  without  knowing  the  outcome  of  an 
uncertainty  variable  prior  to  making  a  decision.  It  is  widely  used  as  a  sensitivity  analysis 
technique  to  rate  the  usefulness  of  various  information  sources,  and  to  decide  whether 
pieces  of  evidence  are  worth  acquisition  before  actually  using  them.  However,  due  to  the 
exponential  time  complexity  of  exactly  computing  VOI  of  multiple  information  sources, 
decision  analysts  and  expert-system  designers  focus  on  the  myopic  VOI,  which  assumes 
observing  only  one  information  source,  even  though  several  information  sources  are  avail¬ 
able.  In  this  paper,  we  present  an  approximate  algorithm  to  compute  non-myopic  VOI  effi¬ 
ciently  by  utilizing  the  central-limit  theorem.  The  proposed  method  overcomes  several 
limitations  in  the  existing  work.  In  addition,  a  partitioning  procedure  based  on  the  d-sep- 
aration  concept  is  proposed  to  further  improve  the  computational  complexity  of  the  pro¬ 
posed  algorithm.  Both  the  experiments  with  synthetic  data  and  the  experiments  with 
real  data  from  a  real-world  application  demonstrate  that  the  proposed  algorithm  can 
approximate  the  true  non-myopic  VOI  well  even  with  a  small  number  of  observations. 
The  accuracy  and  efficiency  of  the  algorithm  makes  it  feasible  in  various  applications 
where  efficiently  evaluating  a  large  amount  of  information  sources  is  necessary. 

©  2008  Elsevier  Inc.  All  rights  reserved. 


1.  Introduction 

In  a  wide  range  of  decision-making  problems,  a  common  scenario  is  that  a  decision  maker  must  decide  whether  some 
information  is  worth  collecting,  and  what  information  should  be  acquired  first  given  several  information  sources  available. 
Each  set  of  information  sources  is  usually  evaluated  by  value-of-information  (VOI).  VOI  is  a  quantitative  measure  of  the  value 
of  knowing  the  outcome  of  the  information  source(s)  prior  to  making  a  decision.  In  other  words,  it  is  quantified  as  the  dif¬ 
ference  in  value  achievable  with  or  without  knowing  the  information  sources  in  a  decision-making  problem. 

Generally,  VOI  analysis  is  one  of  the  most  useful  sensitivity  analysis  techniques  for  decision  analysis  [23,25].  VOI  analysis 
evaluates  the  benefit  of  collecting  additional  information  in  a  specific  decision-making  context  [27],  General  VOI  analyses 
usually  require  three  key  elements:  (1)  A  set  of  available  actions  and  information  collection  strategies;  (2)  A  model  connect¬ 
ing  the  actions  and  the  related  uncertainty  variables  within  the  context  of  the  decision;  and  (3)  values  for  the  decision  out¬ 
comes.  The  methods  of  VOI  analysis  could  be  quite  different  when  different  models  are  used. 

In  this  paper,  we  consider  VOI  analysis  in  decision  problems  modeled  by  influence  diagrams.  Influence  diagrams  were 
introduced  by  Howard  and  Matheson  in  1981  [13]  and  have  been  widely  used  as  a  knowledge  representation  framework 
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to  facilitate  decision  making  and  probability  inference  under  uncertainty.  An  ID  uses  a  graphical  representation  to  capture 
the  three  diverse  sources  of  knowledge  in  decision  making:  conditional  relationships  about  how  events  influence  each  other 
in  the  decision  domain;  informational  relationships  about  what  action  sequences  are  feasible  in  any  given  set  of  circum¬ 
stances;  and  functional  relationships  about  how  desirable  the  consequences  are  [21].  An  ID  can  systematically  model  all 
the  relevant  random  variables  and  decision  variables  in  a  compact  graphical  model. 

In  the  past  several  years,  a  few  methods  have  been  proposed  to  compute  VOI  in  IDs.  Ezawa  [8]  introduces  some  basic  con¬ 
cepts  about  VOI  and  evidence  propagation  in  IDs.  Dittmer  and  Jensen  [7]  present  a  method  for  calculating  myopic  VOI  in  IDs 
based  on  the  strong  junction  tree  framework  [15].  Shachter  [25]  further  improves  this  method  by  enhancing  the  strongjunc- 
tion  tree  as  well  as  developing  methods  for  reusing  the  original  tree  in  order  to  perform  multiple  VOI  calculations.  Zhang 
et  al.  [28]  present  an  algorithm  to  speed  up  the  VOI  computation  by  making  use  of  the  intermediate  computation  results, 
which  are  obtained  when  computing  the  optimal  expected  value  of  the  original  ID  without  the  observations  from  the  infor¬ 
mation  sources.  Instead  of  computing  VOI  directly,  [22]  describe  a  procedure  to  identify  a  partial  order  over  variables  in 
terms  of  their  VOls  based  on  the  topological  relationships  among  variables  in  the  ID.  However,  all  these  papers  only  focus 
on  computing  myopic  VOI,  which  is  based  on  two  assumptions:  (1)  “No  competition:”  each  information  source  is  evaluated 
in  isolation,  as  if  it  were  the  only  source  available  for  the  entire  decision;  (2)  “One-step  horizon:”  the  decision  maker  will  act 
immediately  after  consulting  the  source  [21].  These  assumptions  result  in  a  myopic  policy:  every  time,  the  decision  maker 
evaluates  the  VOI  of  each  information  source  one  by  one,  and  chooses  the  one  with  the  largest  VOI.  Then  the  observations  are 
collected  from  the  selected  information  sources,  the  probabilities  are  updated,  and  all  the  remaining  information  sources  are 
to  be  reevaluated  again,  and  a  similar  procedure  repeats. 

Obviously,  the  assumptions  are  not  always  reasonable  in  some  decision  circumstances.  Usually,  the  decision  maker  will 
not  act  after  acquiring  only  one  information  source.  Also,  although  a  single  information  source  may  have  low  VOI  and  is  not 
worth  acquisition  compared  to  its  cost,  several  information  sources  used  together  may  have  high  VOI  compared  to  their 
combined  cost.  In  this  case,  by  only  evaluating  myopic  VOI,  the  conclusion  will  be  not  to  collect  such  information,  which 
is  not  optimal  since  its  usage  together  with  other  information  sources  can  lead  to  high  value  for  the  decision  maker.  There¬ 
fore,  given  these  limitations  in  myopic  VOI,  it  is  necessary  to  compute  non-myopic  VOI. 

Non-myopic  VOI  respects  the  fact  that  the  decision  maker  may  observe  more  than  one  piece  of  information  before  acting, 
thus  requires  the  consideration  of  any  possible  ordered  sequence  of  observations  given  a  set  of  information  sources.  Unfor¬ 
tunately,  the  number  of  the  sequences  grows  exponentially  as  the  number  of  available  information  sources  increases,  and 
thus  it  is  usually  too  cumbersome  to  compute  non-myopic  VOI  for  any  practical  use,  and  this  is  why  the  before  mentioned 
work  only  focuses  on  myopic  VOI.  Given  these  facts,  an  approximate  computation  of  non-myopic  VOI  is  necessary  to  make  it 
feasible  in  practical  applications.  To  the  best  of  our  knowledge,  [11]  are  the  only  ones  who  proposed  a  solution  to  this  prob¬ 
lem.  In  their  approach,  the  central-limit  theorem  is  applied  to  approximately  compute  non-myopic  VOI  in  a  special  type  of  ID 
for  the  diagnosis  problem,  where  only  one  decision  node  exists.  Certain  assumptions  are  required  in  their  method:  (1 )  all  the 
random  nodes  and  decision  nodes  in  the  ID  are  required  to  be  binary;  (2)  the  information  sources  are  conditionally  indepen¬ 
dent  from  each  other  given  the  hypothesis  node,  which  is  the  node  associated  with  the  decision  node  and  utility  node. 

Motivated  by  the  method  of  Heckerman  et  al.,  we  extend  this  method  to  more  general  cases1 :  ( 1 )  all  the  random  nodes  can 
have  multiple  states  and  the  decision  node  can  have  multiple  rules  (alternatives);  (2)  the  information  sources  can  be  dependent 
given  the  hypothesis  node;  and  (3)  the  ID  can  have  a  more  general  structure.  But  same  as  Heckerman  et  al.’s  method,  we  only 
discuss  the  VOI  computation  in  terms  of  IDs  that  have  only  one  decision  node.  This  decision  node  shares  only  one  utility  node 
with  another  chance  node.  With  the  proposed  algorithm,  non-myopic  VOI  can  be  efficiently  approximated.  In  order  to  validate 
the  performance  of  the  proposed  algorithm,  we  not  only  perform  the  experiments  based  on  the  synthetic  data  for  various  types 
of  IDs,  but  also  provide  a  real-world  application  with  real  data. 

Because  of  the  efficiency  and  accuracy  of  the  proposed  method,  we  believe  that  it  can  be  widely  used  to  choose  the  opti¬ 
mal  set  of  available  information  sources  for  a  wide  range  of  applications.  No  matter  what  selection  strategies  people  use  to 
choose  an  optimal  set,  such  as  greedy  approaches,  heuristic  searching  algorithms,  or  brute-force  methods,  the  proposed 
method  can  be  utilized  to  evaluate  any  information  set  efficiently  in  order  to  speed  up  the  selection  procedure. 

The  following  sections  are  organized  as  follows.  Section  2  presents  a  brief  introduction  to  influence  diagrams.  The  detail  of 
the  algorithm  is  described  in  Section  3.  Section  4  discusses  the  experimental  results  based  on  synthetic  data.  And  a  real  appli¬ 
cation  is  demonstrated  in  Section  5.  Finally,  Section  6  gives  the  conclusion  and  some  suggestions  for  future  work. 

2.  Influence  diagrams 

An  influence  diagram  (ID)  is  a  graphical  representation  of  a  decision-making  problem  under  uncertainty.  Its  knowledge 
representation  can  be  viewed  through  three  hierarchical  levels,  namely,  relational,  functional,  and  numerical.  At  the  rela¬ 
tional  level,  an  ID  represents  the  relationships  between  different  variables  through  an  acyclic  directed  graph  consisting  of 
various  node  types  and  directed  arcs.  The  functional  level  specifies  the  interrelationships  between  various  node  types  and 
defines  the  corresponding  conditional  probability  distributions.  Finally,  the  numerical  level  specifies  the  actual  numbers 
associated  with  the  probability  distributions  and  utility  values  [6], 
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Specifically,  an  ID  includes  three  types  of  nodes:  decision,  chance  (random),  and  value  (utility)  nodes.  Decision  nodes, 
usually  drawn  as  rectangles,  indicate  the  decisions  to  be  made  and  their  set  of  possible  alternative  values.  Chance  nodes, 
usually  drawn  as  circles/ellipses,  represent  uncertain  variables  that  are  relevant  to  the  decision  problem.  They  are  similar 
to  the  nodes  in  Bayesian  networks  [14],  and  are  associated  with  conditional  probability  tables  (CPTs).  Value  nodes,  usually 
drawn  as  diamonds,  are  associated  with  utility  functions  to  represent  the  utility  of  each  possible  combination  of  the  out¬ 
comes  of  the  parent  node.  The  arcs  connecting  different  types  of  nodes  have  different  meanings.  An  arc  between  two  chance 
nodes  represents  probabilistic  dependence,  while  an  arc  from  a  decision  node  to  a  chance  node  represents  functional  depen¬ 
dence,  which  means  the  actions  associated  with  the  decision  node  affect  the  outcome  of  the  chance  node.  An  arc  between 
two  decision  nodes  implies  time  precedence,  while  an  arc  from  a  chance  node  to  a  decision  node  is  informational,  i.e.,  it 
shows  which  variable  will  be  known  to  the  decision  maker  before  a  decision  is  made  [21].  An  arc  pointing  to  a  utility  node 
represents  value  influence,  which  indicates  that  the  parents  of  the  utility  node  are  those  that  directly  affect  its  utility.  Fig.  1 
illustrates  these  arcs  and  gives  corresponding  interpretations. 

Most  IDs  assume  a  precedence  ordering  of  the  decision  nodes.  A  regular  ID  assumes  that  there  is  a  directed  path  contain¬ 
ing  all  decision  nodes;  a  no-forgetting  ID  assumes  that  each  decision  node  and  its  parents  are  also  parents  of  the  successive 
decision  nodes;  and  a  stepwise  decomposable  ID  assumes  that  the  parents  of  each  decision  node  divide  the  ID  into  two  sep¬ 
arate  fractions.  In  this  paper,  we  consider  IDs  that  have  only  one  decision  node,  i.e.,  ignoring  all  previous  decisions.  The  goal 
of  ID  modeling  is  to  choose  an  optimal  policy  that  maximizes  the  overall  expected  utility.  A  policy  is  a  sequence  of  decision 
rules  where  each  rule  corresponds  to  one  decision  node.  Mathematically,  if  there  is  only  one  decision  node  in  an  ID  and 
assuming  additive  decomposition  of  the  utility  functions,  the  expected  utility  under  a  decision  rule  d  given  any  available  evi¬ 
dence  e,  denoted  by  EU(d\e),  can  be  defined  as  follows: 

£D(d|e)  =  ^^p(Xi|e,d)ui(Xi,d),  (f) 

i-l  X, 

where  u,  is  the  utility  function  over  the  domain  X,  u  {D}.  For  example,  X,  could  be  the  parents  of  the  utility  node  that  u,  is 
associated  with.  To  evaluate  an  ID  is  to  find  an  optimal  policy  as  well  as  to  compute  its  optimal  expected  utility  [24,26].  More 
detail  about  IDs  can  be  found  in  [17,14]. 

Generally,  the  advantages  of  an  ID  can  be  summarized  by  its  compact  and  intuitive  formulation,  its  easy  numerical  assess¬ 
ment,  and  its  effective  graphical  representation  of  dependence  between  variables  for  modeling  decision  making  under 
uncertainty.  These  benefits  make  ID  a  widely  used  tool  to  model  and  solve  complex  decision  problems  in  recent  years. 

3.  Approximate  VOI  computation 

3.1.  Value  of  information 

The  VOI  of  a  set  of  information  sources  is  defined  as  the  difference  between  the  maximum  expected  utilities  with  and 
without  the  information  sources  [17],  VOI  can  be  used  to  rate  the  usefulness  of  various  information  sources  and  to  decide 
whether  pieces  of  evidence  are  worth  acquisition  before  actually  using  the  information  sources  [21], 

We  discuss  the  VOI  computation  in  terms  of  IDs  that  have  only  one  decision  node.  This  decision  node  shares  only  one 
utility  node  with  another  chance  node,  as  shown  in  Fig.  2.  And  the  decision  node  and  the  chance  node  are  assumed  to  be 
independent.  In  the  ID,  the  chance  node  0,  named  as  hypothesis  node,  represents  a  mutually  exclusive  and  exhaustive 
set  of  possible  hypotheses  8\,  02, . . . ,  0h\  the  decision  node  D  represents  a  set  of  possible  alternatives  dj,d2, . . .  ,dq;  the  utility 
node  U  represents  the  utility  of  the  decision  maker,  which  depends  on  the  outcome  of  0  and  D;  and  the  chance  nodes 
0] , . . . ,  On  represent  possible  observations  from  all  kinds  of  information  sources  about  the  true  state  of  0.  And  each  0, 
may  have  multiple  states.  Let  0  =  {0! , . . . ,  0„},  the  VOI  of  0,  VOI(O),  w.r.t.  the  decision  node  D,  can  be  defined  as  follows: 

VOI(O)  =  EU(O)  -  EU(0),  (2) 

EU(O)  =X>(o) max  £p(0,|o)u(M).  (3) 

dt€D 

EU(0)  =  max  ^  p(0i)u(0i,  d,),  (4) 

dJeD  6,e@ 


Probabilistic 
y  J  V_y  Dependence 

>  Time 

_  _  Precedence 


o 


Functional 

Dependence 

Informational 


O  Value 
Influence 


Fig.  1.  Interpretations  of  arcs  in  an  ID,  where  circles  represent  chance  (random)  nodes,  rectangles  for  decision  nodes,  and  diamonds  for  value  (utility)  nodes. 
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Fig.  2.  An  ID  example  for  non-myopic  VOI  computation.  0  is  the  hypothesis  node,  D  is  the  decision  node,  and  U  is  the  utility  node.  0,  represents  possible 
observations  from  an  information  source.  There  could  be  hidden  nodes  between  0  and  0,. 


where  u()  denotes  the  utility  function  associated  with  the  utility  node  U,  EU(O)  denotes  the  expected  utility  to  the  decision 
maker  if  0  were  observed,  while  EU(O)  denotes  the  expected  utility  to  the  decision  maker  without  observing  0.  Here  the  cost 
of  collecting  information  from  the  information  sources  is  not  included;  thus,  the  VOI  can  also  be  called  perfect  VOI  [11],  The 
net  VOI  is  the  difference  between  the  perfect  VOI  and  the  cost  of  collecting  information  [12],  Since  after  calculating  the  per¬ 
fect  VOI,  the  computation  of  the  net  VOI  is  just  a  subtraction  of  cost,  we  focus  on  the  perfect  VOI  in  the  subsequent  sections. 

As  shown  in  Eq.  (2),  to  compute  VOI(O),  it  is  necessary  to  compute  EU(O)  and  EU(O)  respectively.  Obviously,  EU(O)  is  eas¬ 
ier  to  compute,  whereas  directly  computing  EU(O)  could  be  cumbersome.  If  the  decision  maker  has  the  option  to  observe  a 
subset  of  observations  {0! _ _  0„}  and  each  0,  has  m  possible  values,  then  there  are  m"  possible  instantiations  of  the  obser¬ 

vations  in  this  set.  Thus,  to  compute  EU(O),  there  are  m"  inferences  to  be  performed.  In  other  words,  the  time  complexity  of 
computing  VOI  is  exponential.  It  becomes  infeasible  to  compute  VOI(O)  when  n  is  not  small. 

The  key  to  computing  VOI(O)  efficiently  is  to  compute  EU(O),  which  can  be  rewritten  as  follows: 

EU(0)  =  ]Tp(o)  max  ]T  p(0j|o)u(Mj)  =  XI  IJianx  Y.p^P^0^di’d^  =  X  ““  X  IWMj)-  (5) 

oeO  fl,e©  oeO  S,e0  oeO  0,e0 

It  is  noticed  that  each  instantiation  of  0  corresponds  to  a  specific  optimal  action  for  the  decision  node  D.  We  define  the  deci¬ 
sion  function  5  :  0  ^  D,  which  maps  an  instantiation  of  0  into  a  decision  in  D.  For  example,  <5(o)  =  dk  indicates  when  the 

observation  is  o,  the  corresponding  optimal  decision  is  dk,  dk  =  argrnaXd^D^.ggpjdijojujdj,  dj).  Therefore  we  can  divide  all 
the  instantiations  of  0  into  several  subsets,  where  the  optimal  action  is  the  same  for  those  instantiations  in  the  same  subset. 

Specifically,  if  D  has  q  decision  rules,  {cfi . d,},  all  the  instantiations  of  0  can  be  divided  into  q  subsets,  0^,0^, ...  ,odq, 

where  odk  ={oe  0|<5(o)  =  dk}.  Fig.  3  illustrates  the  relationships  between  each  instantiation  and  the  q  subsets.  Thus,  from 
Eq.  (5),  EU(O)  can  be  further  derived  as  follows: 

eu(o)  =  x  pm  X  X  m 

0i£0  k=  1  oeodk 

In  the  next  several  sections,  we  show  how  to  compute  EU(O)  efficiently. 

3.2.  Decision  boundaries 

In  Eq.  (6),  the  difficult  part  is  to  compute  J2oeod  p(o|0i)  because  the  size  of  the  set  od(.  could  be  very  large  based  on  the 
previous  analysis.  In  order  to  compute  it  efficiently,  it  is  necessary  to  know  how  to  divide  all  the  instantiations  of  0  into 
the  q  subsets.  We  first  focus  on  the  case  that  0  has  only  two  states,  ifi,  02,  and  then  extend  it  to  the  general  case  in  Section 
3.4. 

Based  on  the  definition,  the  expected  utility  of  taking  the  action  dk  is  EU(dk)  =  pfffi)  *  Uu  +  p(02)  *  u2k,  where 
uu  =  u(6j,dk),  and  u2k  =  u(02,dk).  We  can  sort  the  index  of  all  the  decision  rules  based  on  the  utility  functions,  such 
that  ulk  >  u ij  and  u2k  <  u2i  for  k  <j.  Fig.  4  gives  an  example  of  the  utility  function  u(0,D).  As  shown  in  the  figure,  as  k 
increases,  ulk  decreases  and  u2k  increases.  If  there  is  an  action  d,  that  cannot  be  sorted  according  to  this  criterion,  it  is  either 


Fig.  3.  (a)  Each  o,  corresponds  to  an  instantiation;  (b)  all  the  instantiations  can  be  divided  into  q  subsets,  where  each  instantiation  in  the  set  od.  corresponds 
to  the  optimal  decision  d,. 
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Fig.  4.  An  example  of  the  utility  function  U(0,D). 


dominated  by  another  action,  or  it  dominates  another  action.  (If  u(d,7  0)  is  always  larger  than  u(dj,  0),  no  matter  what  state 
of  0  is,  we  say  d,  dominates  d,  ).  Then  the  dominated  action  can  be  removed  from  the  set  of  possible  actions,  without  chang¬ 
ing  the  optimal  policy. 

Proposition  1.  Let  r]k  =  U|t_^.+u22*_U2t,  Pi,  =  ma xk<jiqrjk,  and  p*ku  =  min l4j<krjk,  then  dk  is  the  optimal  action  if  and  only  if 
Pi,  ^  p(9 1)  p*ku.  In  addition,  p*(  =  0  and  p\u  =  1.  (Here  k  is  the  index  of  an  action.) 

Proof,  see  Appendix.  □ 

Proposition  1  presents  that  if  the  probability  of  0  being  9,  is  between  pkl  and  pku,  dk  is  the  optimal  decision.  From  this,  we 
can  further  derive  Proposition  2. 

Proposition  2 

^p(«)=p(pii^(flii°)^;,).  (7) 

°e% 


Proof,  see  Appendix.  □ 

The  Proof  of  Proposition  2  establishes  Eq.  (7)  by  showing  that  both  sides  of  this  equation  express  the  probability  that  dk  is 
the  optimal  decision  for  0, .  Based  on  Proposition  2,  we  can  get  the  following  corollary. 

Corollary  1 

Y  P(°lei)  =  P(Pa  <  P(01 1°)  <  PkuN, 

X]  P(° l02)  =  P(Pkl  <  P(0 1  1°)  SS  Plu \e2)- 

oe% 

The  equations  in  Corollary  1  indicate  the  probability  that  the  decision  maker  will  take  the  optimal  decision  dk  after 
observing  new  evidence,  given  the  situation  that  the  state  of  0  is  9,  before  collecting  the  evidence. 

Based  on  Corollary  1,  the  problem  of  computing  52  P(o|0,),  f  =1,2,  (from  Eq.  (6))  transfers  to  the  problem  of  comput¬ 

ing  p(ph  ^  p(0j|o)  ^  Plu\6i),  which  is  the  topic  of  the  next  section.  We  will  focus  on  p(p*kl  ^  p(0j|o)  <  pju  |Gi )  only  because  the 
procedure  of  computing  p(p*kl  ss  p(0y|o)  <  pjjJ02)  is  similar. 

3.3.  Approximation  with  central-limit  theorem 


(8) 

(9) 


3.3.1.  A  partitioning  procedure 

To  compute  p(p*kl  ^  p(9,  |o)  ^  Plu\9j),  one  way  is  to  treat  p(0,|o)  as  a  random  variable.  If  the  probability  density  function  of 
this  variable  is  known,  it  will  be  easy  to  compute  p(p*kl  ^  p(6,  |o)  ^  p‘ku\0\).  However,  it  is  hard  to  get  such  a  probability  den¬ 
sity  function  directly.  But  we  notice  that  p(p*kl  ^  p(0,  |o)  ss  p£u|0,)  =  p(r=^-  si  |°j  si  |0y).  Based  on  the  transformation 
property  between  a  random  variable  and  its  function  [2],  it  is  straightforward  that  p(pkl  ^  p(0,|o)  Plu  l0l) 


— :  n  — «-  <  P(8'  l0)  <  IQ.  ) 

^P(»2|0)-  1  -PL1  V- 

Let  us  take  a  closer  look  at  because  it  is  critical  in  the  approximate  algorithm. 
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If  all  the  0,  nodes  are  conditionally  independent  from  each  other  given  0,  based  on  the  chain  rule: 

P(fli|0)_P(0i|8i)  P(0n|fli)p(9i)  nm 

p(82\0)  p(O,|02)'"p(On|02)  p(02)‘  1  1 

Usually  some  Ots  may  not  be  conditionally  independent  given  0.  We  will  show  that  is  approximately  distributed  as  a 
log-normal  random  variable.  However,  in  order  to  prove  it,  it  is  necessary  to  obtain  a  format  similar  to  Eq.  (10)  even  when  0,s 
are  not  conditionally  independent.  We  thus  propose  a  partitioning  procedure  to  partition  0  into  several  groups  based  on  the 
principle  of  d-separation  [21  ],  where  the  nodes  in  one  group  are  conditionally  independent  from  the  nodes  in  other  groups. 
This  procedure  consists  of  three  steps. 


(1)  Decide  whether  two  nodes,  0,,  Oj,  are  conditionally  independent  given  0  by  exploring  the  ID  structure  based  on  four 
rules:  (i)  if  there  is  a  directed  path  between  0,  and  Oj  without  passing  0,  0,  and  0,  are  dependent;  (ii)  if  both  0,  and  Oj 
are  the  ancestors  of  0,  0,  and  Oj  are  dependent  given  0;  (iii)  after  removing  the  links  to  and  from  0  from  the  original 
ID,  if  Oj  and  Oj  have  common  ancestors,  or  0,  is  0/s  ancestor,  or  vice  versa,  then  0,  and  0,  are  dependent;  and  (iv)  in  all 
the  other  cases,  0,  and  Oj  are  conditionally  independent  given  0. 

(2)  Build  an  undirected  graph  to  model  the  relationships  between  the  nodes.  In  such  a  graph,  each  vertex  represents  an  0, 
node,  and  each  edge  between  two  vertices  indicates  that  the  two  corresponding  nodes  are  dependent  according  to  the 
rules  in  Step  1. 

(3)  Partition  the  graph  into  disjoint  connected  subgraphs.  A  depth  first  search  (DFS)  algorithm  [4]  is  used  to  partition  the 
graph  into  several  connected  components  (disjoint  connected  subgraphs)  so  that  each  component  is  disconnected  from 
other  components.  The  nodes  in  each  connected  component  are  conditionally  independent  from  the  nodes  in  any 
other  connected  components.  Therefore,  each  connected  component  corresponds  to  one  group. 

For  example,  for  the  ID  in  Fig.  5a,  with  the  partitioning  procedure,  the  0,  nodes  can  be  divided  into  five  groups,  (Oi ,  02}, 
{O3.O4.O5},  {06},  {07},  and  {08,0g}.  Fig.  5b  shows  the  graph  built  by  the  partitioning  procedure. 


3.3.2.  Central-limit  theorem 

Generally,  with  the  partition  procedure  presented  in  the  previous  subsection,  0  can  be  automatically  divided  into  several 
sets,  named  0s’  ,052, ...  ,05g,  where  g  is  the  overall  number  of  the  groups.  Thus,  Eq.  (10)  can  be  modified  as  follows: 


P(9i|0)  _P(0Sllgi)  p(0*19i)p(fli)  ,  P(9il0)_f-lnP(0Si|9i)  i  P(9i) 
P(02|O)  p(OSl|02)  P(O%|02)  p(62)  p(B2\0)  jrj  P(OS'|02)  P(02) 


where 


P(0i|O) 

p(02|O)’ 


wt  =  In 


P(0S‘  Ift) 
P(OSi|02)’ 


c  =  In 


P(S  1) 
P(B 2) ' 


s 

In  <j>  =  ^2  wi  +  c> 

i-1 


(11) 


In  the  above  equation,  c  can  be  regarded  as  a  constant  reflecting  the  state  of  0  before  any  new  observation  is  obtained  and 
any  new  decision  is  taken.  Here,  we  assume  p(02|O),  p(OS||02),  and  p(02)  are  not  equal  to  0. 

Let  W  =  JXiwi  be  the  sum  of  w,.  Following  [11],  we  use  the  cental-limit  theorem  to  approximate  W.  The  central-limit 
theorem  [9]  states  that  the  sum  of  independent  variables  approaches  a  Gaussian  distribution  when  the  number  of  variables 
becomes  large.  Also,  the  expectation  and  variance  of  the  sum  is  the  sum  of  the  expectation  and  variance  of  each  individual 
random  variable.  Thus,  regarding  each  w,  as  an  independent  variable,  W  then  follows  a  Gaussian  distribution.  Then,  based  on 
Eq.  (11),  </>  will  be  a  log-normal  distribution.  For  a  random  variable  X,  if  ln(X)  has  a  Gaussian  distribution,  we  say  X  has  a  log¬ 
normal  distribution.  The  probability  density  function  is:  p(x)  =  /<2S  \  denoted  as  X  ~  LogN(AlS2)  [5],  where  M 

and  S  are  the  mean  and  standard  deviation  of  the  variable’s  logarithm  [1  ].  In  order  to  assess  the  parameters  (mean  and  var- 


Fig.  5.  (a)  An  ID  example;  (b)  the  graph  built  by  the  partitioning  procedure. 
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iance)  of  the  log-normal  distribution,  we  need  to  compute  the  mean  and  the  variance  of  each  w,.  The  computational  process 
is  shown  as  follows. 

Assume  0s'  has  r;  instantiations,  {o^' , . . . ,  0^' },  where  r,  is  the  product  of  the  number  of  the  states  for  each  node  in  the  group 
0s',  e.g.,  if  0s'  =  {0! ,  02},  and  both  Oi  and  02  have  three  states,  then  r,  =  3  *  3  =  9.  Table  1  gives  the  value  and  the  probability 
distribution  for  each  wf: 

Based  on  the  table,  the  expected  value  p,  and  the  variance  a2  for  each  w,  can  be  computed  as  follows: 


^(w(|fli)  =  Ep(o/|fli)/n^y, 


2(w,|0i)  =  J]p(os'|01)in; 


i= i 


P(°jlei) 

p(Oj‘\e2) 


-  Jtt2(w,  |0i 


(12) 

(13) 


By  the  central-limit  theorem,  the  expected  value  and  the  variance  of  W  can  be  obtained  by  the  following  equations: 

fl(W\Br)^J2jl(Wi\9r),  (14) 

i=l 

tr2(W|01)  =  ^tr2(wi|01).  (15) 

i-l 

Therefore,  based  on  Eq.  (11),  for  W  ~  N(p(W|0!),  <r2(W|0i)),  we  have  </>  ~  LogN(p(W|0i)  +  c,  <t2(W|0i)),  where  LogN  denotes 
the  log-normal  distribution.  After  getting  the  probability  distribution  function  and  the  function  parameters  for  cf>  in  Eq.  (11), 
we  are  ready  to  assess  the  non-myopic  VOI. 

Before  we  go  to  the  next  section,  we  first  analyze  the  computational  steps  involved  in  computing  the  parameters  for  the 
log-normal  distribution,  which  is  the  most  time-consuming  part  in  the  algorithm.  Based  on  Eqs.  (12)  and  (14),  the  overall 
number  of  the  computational  steps  is  450f=1r,-  +  2 g.  We  will  show  that  this  number  is  much  smaller  than  the  overall  number 
of  the  computational  steps  in  the  exact  computational  method  during  the  algorithm  analysis  in  Section  3.5. 


3.3.3.  Approximate  non-myopic  value-of-information 

Based  on  Proposition  1  in  Section  3.2,  we  know  that  dk  is  the  optimal  action  with  the  probability  p(p*kl  ^  p(0 1  jo)  ^  pku), 
which  is  equivalent  to  p(  t%-  <  </>  ^  as  shown  in  Section  3.3.1.  Let  </)(,  =  p“  ,  and  tfi*ku  =  thus,  dk  is  the  optimal  deci- 
sion  if  and  only  if  ^  <j>  ^  (j>‘ku.  Then,  based  on  Corollary  1  in  Section  3.2,  the  following  equation  stands: 

EM*)  (16) 

0€% 

Furthermore,  from  Section  3.3.2,  we  know  that  </>  ~  LogN(/i(W|0i)  +  c,  a2(W\Q-[)),  thus, 

1  r<t>*ku  -(lnx-/<(Wf|91  )-c)2 

P(^<KW)=  /  e  dx,  (17) 

(j(W|0,)V27tx 

Pi^h  ^  ^  ^  <t>lu\e2)  can  be  computed  in  the  same  way  by  replacing  0 1  with  02  in  the  previous  equations. 

Therefore,  VOI  can  be  approximated  by  combining  Eqs.  (2),  (6),  (16),  and  (17).  Fig.  6  shows  the  key  equations  of  the  algo¬ 
rithm  when  0  has  only  two  states.  In  summary,  to  approximate  VOI(O)  efficiently,  the  key  is  to  compute  £17(0),  which  leads 
to  an  approximation  of  p(o|0i)  with  the  log-normal  distribution  by  exploiting  the  central-limit  theorem  and  the  deci¬ 
sion  boundaries. 


3.4.  Generalization 


In  the  previous  algorithm,  the  node  0  only  allows  two  states,  although  the  other  random  nodes  and  the  decision  node  can 
be  multiple  states.  However,  in  real-world  applications,  0  may  have  more  than  two  states.  In  this  section,  we  extend  the 
algorithm  to  the  case  that  0  can  have  several  states  too.  Assume  0  has  h  states,  6r,...,6h,  and  still,  d  has  q  rules, 
d, .....  dq,  similarly  to  Eq.  (11 ),  we  have  the  following  equations: 


Table  1 

The  probability  distribution  of  w, 
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Decision 

Boundaries 


Central  Limit 
Theorem 


Fig.  6.  The  key  equations  to  approximate  VOI  when  0  has  only  two  states,  D  has  multiple  rules,  and  the  other  nodes  have  multiple  states. 


mm 

P(9h\0) 


P(0Sl  |ft) 
P(Os'\0h) 


P(QSs\Bi)  P(ft) 

p(o*i  oh)  pm’ 


ij£h 

p(m 


gw,  +  c„  where  *  =  p(#i|0) 


In 


w‘. 


P(6j\Q) 

P(Bh\0) 


Eln 


In 


P(0%|8Q 

P(0*lft.)’ 


P(0St |ft)  ,  .nP(ft) 
pC^ieh)  p{8h) 

P(ft) 


In  fa 


C:  =  In 


P(ftO ' 


(18) 


Let  Wj  =  J2k=iw[’ 1  ^  h’  Wi  still  has  a  Gaussian  distribution.  Here,  we  assume  p(0h\O),  pfO^IP,,),  and  p(SA)  are  not  equal  to  0. 
The  similar  method  in  Section  3.3  can  be  used  to  compute  the  variance  and  the  mean.  Specifically,  for  the  new  defined  w[  in 
the  above  equation,  Table  1  can  be  modified  as  follows  (see  Table  2). 

Thus,  we  get  the  following  equations: 


I  ft) 


V*lft)  =  Ep(°?lft)  In2^ 


p(°*lftO 


1  <  i  <  h,  1  ^  j  <  ft,  1  4k  ^g, 
-p2K|0/). 


(19) 

(20) 


Similar  to  Eq.  (14),  the  expected  value  and  the  variance  of  W,  can  be  obtained  as  we  see  here: 


P(W|ft)  =  J2p(w[\ Bj),  l  <  i  <  h,  1  h,  (21) 

k=  1 

a2(Wil0j)  =  '£a2(wikl0j).  (22) 

k= 1 

Accordingly,  <f>i  follows  the  log-normal  distribution  with  Sy  =  a(Wi\9j)  and  My  =  //( W; |0,)  +  q.  We  denote  the  probability 
density  function  of  <j>t  given  6j  as/0j(^>,).  Eqs.  (19)  and  (21 )  show  that  the  overall  number  of  the  computational  steps  to  assess 
the  parameters  for  the  log-normal  distributions  is  4hJ]f=1rk  +  2h(h  -  l)g  when  h  >  2. 

Even  though  /^  (</;,-)  can  be  easily  obtained,  it  is  still  necessary  to  get  the  decision  boundaries  for  each  optimal  decision  in 
order  to  efficiently  compute  J2oeod  P(°\Bj)-  Therefore,  a  set  of  linear  inequality  functions  need  to  be  solved  when  0  has  more 
than  two  states.  For  example,  if  dk  is  the  optimal  action,  EU(dk)  must  be  larger  than  the  expected  utility  of  taking  any  other 
action.  Based  on  this,  a  set  of  linear  inequality  functions  can  be  obtained: 


Table  2 

The  probability  distribution  of  w[ 
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P(0l)Uu  +  P(02)U2k  +  ■  ■  ■  +  P(Oh)Uhk  =5  P(9,)Uy  +  ■  ■  ■  +  P(0h)Uhj 

Uu  -  Uy  +  U/,j  -  Uhk  ...  U(h_l)l t  -  u(/i-l)j  +  uhj  -  Uhk 

=>• - P(Pi )  H - 1 - 

M/ij  —  Uhk  Uhj  —  Uhk 

Uu-Uij  p(0 1) _ U(h-l)k  -  U(/1-1  )j  p(gft-l)  >  1 

U/y-U/ik  P(6h)  Uhj  ~Uhk  P{0h) 


•P(  0/1-1 )  >  1 


We  assume  1%  -  >  0;  otherwise,  is  changed  to  in  the  last  inequality. 

Let  Ak  be  the  solution  region  of  the  above  linear  inequalities,  then 

]Tp(o|0j)=  f  /^,)  •••/%(<£,,_,)  dAk,  l^k<q. 

oeod/,  •'Ac 


(23) 


(24) 


The  right  side  of  Eq.  (24)  is  an  integral  over  the  solution  region  Ak  decided  by  the  linear  inequalities.  We  first  demonstrate 
how  to  solve  the  integral  when  0  has  three  states,  and  then  introduce  the  method  for  the  case  that  0  has  more  than  three 
states. 

When  0  has  three  states,  Eq.  (23)  can  be  simplified  as  follows: 

p(0i)  P(0t) 

p(0i)u,k  +  p(d2)u2k  +  p(03)u3k  >  p(0i)Uij  +  p(02)u2j  +  p(03)u3j  aiir  9ft  +  a2ig  ■  9ft  >  1, 

P(y3)  P\U3) 

where  alkJ-  =  — — —  and  a2w  =  — — —  •  (25) 

J  u3J  -  u3k  1  u3j  -u3k 

In  the  above,  it  is  assumed  that  u3 j  >  u3k;  if  u3j  <  u3k,  then  is  changed  to  in  the  last  inequality. 

And  Eq.  (24)  can  be  simplified  as  follows: 

E  P(° l®j)  =  /  1  ^  k  <  q,  1  sj  j  <  3,  (26) 


Ak  is  decided  by  (q  -  1)  linear  inequalities  and  each  inequality  has  two  variables  <-/>,  and  as  defined  in  Eq.  (25).  We  use  the 
following  steps  to  solve  this  integral  when  Ak  is  a  finite  region. 


1.  Identify  all  the  lines  that  define  the  inequalities  and  find  all  the  intersection  points  between  any  two  lines  as  well  as  the 
intersection  points  between  any  line  and  the  x  (ory)  axis. 

2.  Choose  the  intersection  points  that  satisfy  all  the  linear  inequalities,  and  use  them  as  vertices  to  form  a  polygon. 

3.  Divide  the  polygon  into  several  simple  regions:Specifically,  for  each  vertex,  we  generate  a  line  crossing  this  vertex  and 
parallel  to  the  y-axis.  The  lines  then  divide  the  polygon  into  several  simple  regions. 

4.  Evaluate  the  integral  in  each  simple  region  and  sum  the  values  together. 


An  example  of  the  solution  region  is  shown  in  Fig.  7.  In  this  example,  if  y.lkj  >  oc3kj(i  y ^  j),  then  a2kj  >  oc2kj  too.  Therefore,  the 
solution  region  can  be  decided  by  the  intersection  points  of  the  lines  that  are  defined  by  the  linear  inequalities  and  the  axes. 
For  example,  in  Fig.  7,  Ak  is  decided  by  a-d,  which  are  selected  from  the  intersection  points  {(l/au^O), 

(0,  l/a2fcj),j  =  1, _ qj^k}.  Based  on  [3],  the  time  complexity  of  solving  m  linear  inequalities  with  n  variables  (each 

inequality  only  has  two  variables)  is  O(mn\ogm  +  mn2log2n).  In  this  case,  n  is  2  and  m  is  q  -  1. 


Fig.  7.  A  solution  region  of  a  group  of  linear  inequalities. 
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When  0  has  more  than  three  states,  the  integral  needs  to  be  performed  in  a  high-dimension  space  (dimension  is  larger 
than  2).  Therefore,  we  solve  it  with  Quasi-Monte  Carlo  integration  [10,16],  which  is  a  popular  method  to  handle  multiple 
integral.  Quasi-Monte  Carlo  integration  picks  points  based  on  sequences  of  quasirandom  numbers  over  some  simple  domain 
A'k  which  is  a  superset  of  Ak,  checks  whether  each  point  is  within  Ak,  and  estimates  the  area  (n-dimensional  content)  ofA^  as 
the  area  of  A'k  multiplied  by  the  fraction  of  points  falling  within  Ak.  Such  a  method  is  implemented  by  Mathematica  [20], 
which  can  automatically  handle  a  multiple  integral  with  a  region  implicitly  defined  by  multiple  inequality  functions. 

Fig.  8  shows  the  key  equations  of  the  algorithm  when  0  has  multiple  states.  The  main  equations  are  similar  to  those  in 
Fig.  6.  However,  since  0  has  multiple  states,  it  becomes  more  complex  to  obtain  the  parameters  of  the  log-normal  distribu¬ 
tion  and  perform  the  integration. 

3.5.  Algorithm  analysis 

Now,  we  analyze  the  computational  complexity  of  the  proposed  approximation  algorithm  compared  to  the  exact  compu¬ 
tational  method.  For  simplicity,  assume  that  the  number  of  the  state  of  each  0,  node  is  m,  and  there  are  n  nodes  in  the  set  0. 
Assume  we  only  count  the  time  used  for  computing  expected  utilities.  Then  the  computational  complexity  of  the  exact  VOl 
computational  method  is  approximately  hmn,  where  h  is  the  number  of  the  state  of  the  0  node.  With  the  approximation 
algorithm,  the  computational  complexity  is  reduced  to  hmk,  where  h  is  the  number  of  the  state  of  the  0  node,  and  k  is 
the  number  of  0 ,•  nodes  in  the  maximum  group  among  {0s' , . . . ,  0s®}.  In  the  best  case,  if  all  the  0,  nodes  are  conditionally 
independent  given  0,  the  time  complexity  is  about  linear  with  respect  to  m.  In  the  worst  case,  if  all  the  0,  nodes  are  depen¬ 
dent,  the  time  complexity  is  approximately  mn.  However,  usually,  in  most  real-world  applications,  k  is  less  than  n,  thus,  the 
approximate  algorithm  is  expected  to  be  more  efficient  than  the  exact  computational  method,  as  will  be  shown  in  the  exper¬ 
iments.  For  example,  for  the  ID  in  Fig.  5,  n  =  9,  m  =  4,  h  =  3,  and  q  =  3.  Then,  for  the  exact  computation,  the  number  of  com¬ 
putations  is  around  3  *  49  =  786432,  while  using  the  approximate  algorithm,  the  number  of  computations  is  only  around 
3  *  43  =  192. 

However,  in  addition  to  the  cost  of  computing  expected  utilities,  the  approximation  algorithm  also  includes  some  extra 
costs:  sorting  the  utility  functions  (Section  3.2),  partitioning  the  0  set  (Section  3.3.1),  and  deciding  the  decision  boundaries 
(Section  3.2)  when  0  has  two  states,  or  performing  the  integral  when  0  has  more  than  two  states  (Section  3.4).  These  costs 
are  not  included  in  the  above  analysis.  In  general,  the  extra  time  in  these  steps  is  much  less  than  the  time  used  for  computing 
expected  utilities.  For  example,  the  time  complexity  of  sorting  is  0(q  log(q)),  the  time  complexity  of  the  partition  procedure 
is  0(| V|  +  |E|)  (V  is  the  set  of  vertex,  and  £  is  the  set  of  edges  in  an  ID),  and  the  time  complexity  in  deciding  the  decision 
boundaries  when  0  has  two  states  is  0(q2).  When  0  has  more  than  two  states,  deciding  the  decision  boundaries  needs  addi¬ 
tional  time.  Empirically,  it  does  not  affect  the  overall  speed,  as  will  be  shown  in  the  experiments.  In  addition,  most  steps  in 
computing  expected  utilities  involve  performing  inferences  in  an  ID,  which  is  usually  NP-hard  and  thus  consumes  much 
more  time  than  a  step  in  the  procedures  of  sorting,  partitioning,  and  integrating. 


V0I(0)  =  EU(0)-EU(0) 

7\ 


EU(o)  =  2>W)i  2><°l  W<M*) 

^e©  fc=l  oeojk 


2  P(°\0j)=  J 

oeo<.  A 

-1  )^fc 

1 

(j).  —Lognormal  distribution 

Decision 

Region 


Central  Limit 
Theorem 


Fig.  8.  The  key  equations  to  compute  VOI  when  0  has  multiple  states. 
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Table  3 

ID  structures 


I< 

5 

4 

3 

2 

1 

Number  of  IDs 

2 

3 

3 

1 

1 

k  is  the  size  of  the  biggest  group  after  partitioning. 


Table  4 

Testing  cases 

50  test  cases,  where  0,  nodes  are  conditionally  independent  given  0  whose  state  is  binary 

50  test  cases,  where  0,  nodes  are  conditionally  independent  given  0  who  has  three  states 

50  test  cases,  where  0,  nodes  are  conditionally  independent  given  0  who  has  four  states 

450  test  cases,  where  0,  nodes  are  conditionally  dependent  given  0  whose  state  is  binary 

450  test  cases,  where  0,  nodes  are  conditionally  dependent  given  0  who  has  three  states 

450  test  cases,  where  0,  nodes  are  conditionally  dependent  given  0  who  has  four  states 


IDJndep:  2-state 
IDJndep:  3-state 
IDJndep:  4-state 
ID_dep:  2-state 
ID_dep:  3-state 
ID_dep:  4-state 


4.  Experiments 


The  experiments  are  designed  to  demonstrate  the  performance  of  the  proposed  algorithm  compared  to  the  exact  VOI 
computation.  We  limit  the  ID  test  model  with  at  most  5  layers2  and  up  to  1 1  information  sources  due  to  the  exponential  com¬ 
putational  time  behind  the  exact  computation.  Ten  different  ID  models  are  constructed,  where  in  one  of  the  IDs  the  0  nodes  are 
conditionally  independent  given  the  0  node.  Table  3  describes  the  structures  of  these  IDs.  The  IDs  are  parameterized  with  150 
sets  of  different  conditional  probability  tables  and  utility  functions,  a  process  which  yields  1500  test  cases.  In  each  the  one-third 
of  them,  0  node  has  2,  3,  and  4  states,  respectively.  Without  loss  of  generality,  all  the  other  random  nodes  and  the  decision  node 
have  four  states. 

For  each  test  case,  the  VOIs  for  different  0  subsets  with  the  size  from  3  to  1 1  are  computed.  The  results  from  the  approx¬ 
imation  algorithm  are  compared  to  the  exact  computation  implemented  with  the  brute-forth  method.  Let  VOIt  be  the 
ground-truth,  and  VOI  be  the  value  computed  with  the  proposed  algorithm.  Assuming  VOIt  J  0,  the  error  rate  is  defined 
as  follows: 


Err  = 


|VOIt  -  VOI 
VOIt 


The  1 500  test  cases  described  previously  are  divided  into  six  groups,  named  as  IDJndep:  2-state,  IDJndep:3-state,  IDJndep:4- 
state,  ID_dep:2-state,  ID_dep :  3-state,  and  ID_dep:4-state.  Table  4  describes  the  six  groups. 

Fig.  9  illustrates  the  results  from  the  six  groups  of  1500  test  cases.  Chart  (a)  shows  the  average  errors  for  each  group,  while 
Chart  (b)  shows  the  VOIs  for  one  specific  case,  which  is  randomly  chosen  from  the  test  cases  from  ID_dep:  3-state.  As  the  set 
size  of  the  0,  nodes  increases,  the  error  rate  decreases.  When  the  state  number  of  0  is  the  same,  the  error  rates  of  the  depen¬ 
dent  cases  are  larger  than  the  error  rates  of  the  conditional  independent  cases.  This  can  be  explained  by  the  reason  that  the 
IDs  in  the  dependent  cases  have  fewer  independent  0  subsets  than  the  ID  in  the  independent  groups.  Since  the  central-limit 
theorem  is  the  basis  of  our  algorithm,  it  works  better  when  the  number  of  w,  increases,  which  corresponds  to  the  number  of 
independent  0  subsets.  Even  when  the  size  of  0  set  is  as  small  as  6,  the  average  error  is  less  than  or  around  0.1  for  all  the 
cases.  We  could  run  several  larger  IDs  with  much  more  0,  nodes,  and  the  error  curve  would  be  progressively  decreasing. 
Here,  we  intend  to  show  the  trend  and  the  capability  of  this  algorithm. 

Charts  (c)  and  (d)  show  the  average  computational  time  with  the  exact  computation  and  the  approximation  computation. 
When  the  set  size  of  the  0,  nodes  is  small,  the  computational  time  is  similar.  However,  as  the  size  becomes  larger,  the  com¬ 
putational  time  of  the  exact  computation  increases  exponentially,  while  the  computational  time  of  the  approximation  algo¬ 
rithm  increases  much  slower.  Thus,  the  larger  the  0  set  size  is,  the  more  time  the  approximation  algorithm  can  save. 
Likewise,  as  the  number  of  the  state  of  each  0,  node  further  increases,  the  computational  saving  would  be  more  significant. 
As  the  number  of  states  of  0  increase,  the  computational  time  also  slightly  increases. 


5.  An  illustrative  application 

We  use  a  real-world  application  in  human  computer  interaction  to  demonstrate  the  advantages  of  the  proposed  algo¬ 
rithm.  Fig.  10  shows  an  ID  for  user  stress  recognition  and  user  assistance.  The  diagram  consists  of  two  portions.  The  upper 
portion,  from  the  top  to  the  “stress”  node,  depicts  the  elements  that  can  alter  human  stress.  These  elements  include  the 
workload,  the  environmental  context,  specific  character  of  the  user  such  as  his/her  trait,  and  importance  of  the  goal  that 


The  length  of  the  longest  path  starting  from  (or  ending  at)  the  hypothesis  node  is  5. 
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Fig.  9.  Results  from  the  four  groups  of  1500  test  cases:  (a)  average  error  rates  with  the  approximation  algorithm;  (b)  VOIt  vs.  VOI  for  one  test  case  from 
lD_dep:  3-state ;  (c)  computational  time  (log(t),  unit  is  second)  for  the  groups  of  ID_indep:n-state,  n  =  2,3,4;  and  (d)  computational  time  (log(t),  unit  is 
second)  for  the  groups  of  ID_dep:n-state,  n  =  2, 3,4. 


Fig.  10.  An  influence  diagram  for  recognizing  human  stress  and  providing  user  assistance.  Ellipses  denote  chance  nodes,  rectangles  denote  decision  nodes, 
and  diamonds  denote  utility  nodes.  All  the  chance  nodes  have  three  states. 


he/she  is  pursuing.  This  portion  is  called  predictive  portion.  On  the  other  hand,  the  lower  portion  of  the  diagram,  from  the 
“stress”  node  to  the  leaf  nodes,  depicts  the  observable  features  that  reveal  stress.  These  features  include  the  quantifiable 
measures  on  the  user  physical  appearance,  physiology,  behaviors,  and  performance.  This  portion  is  called  diagnostic  portion. 
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Size  of  Sensor  Set  Size  of  Sensor  Set 

Fig.  11.  Results  for  the  stress  modeling:  (a)  average  errors  with  the  approximation  algorithm;  (b)  Euclidean  distance  between  the  true  and  approximated 
J2oeod  P(°|0i)*  (c)  computational  time  (log(t),  unit  is  second);  (d)  true  VOI  vs.  approximated  VOI. 


The  hybrid  structure  enables  the  ID  to  combine  the  predictive  factors  and  observable  evidence  in  user  stress  inference.  For 
more  detail  please  refer  to  [19]. 

To  provide  timely  and  appropriate  assistance  to  relieve  stress,  two  types  of  decision  nodes  are  embedded  in  the  model  to 
achieve  this  goal.  The  first  type  is  the  assistance  node  associated  with  the  stress  node,  which  includes  three  types  of  assis¬ 
tance  that  have  different  degrees  of  impact  and  intrusiveness  to  a  user.  Another  type  of  decision  nodes  is  the  sensing  action 
node  (S,  node  in  Fig.  10).  It  decides  whether  to  activate  a  sensor  for  collecting  evidence  or  not.  Through  the  ID,  we  decide  the 
sensing  actions  and  the  assistance  action  sequentially.  In  order  to  first  determine  the  sensing  actions  (which  sensors  should 
be  turned  on),  VOI  is  computed  for  a  set  S  consisting  of  S,.  Using  the  notations  defined  before,  we  have 
VOl(S)  =  VOI(£)  -  £]SieSUj(Si),  where  E  is  the  set  of  observations  corresponding  to  S  and  VOI(£)  =  EU(E)  -  EU(E). 

Fig.  1 1  shows  the  experimental  results  for  the  stress  model.  We  enumerate  all  the  possible  combinations  of  sensors  and 
then  compute  the  value-of-information  for  each  combination.  Chart  (a)  illustrates  the  average  VOI  errors  for  different  sensor 
sets  with  the  same  size.  And  Chart  (b)  displays  the  Euclidean  distance  between  the  true  and  estimated  probabilities 
J2oeod  P(o|0i)  (Eq.  (26)).  Similarly  to  the  simulation  experiments,  the  error  decreases  as  the  size  of  0  set  increases,  and  the 
computational  time  increases  almost  linearly  in  the  approximation  algorithm. 

6.  Conclusions  and  future  work 

As  a  concept  commonly  used  in  influence  diagrams,  VOI  is  widely  used  as  a  criterion  to  rate  the  usefulness  of  various 
information  sources,  and  to  decide  whether  pieces  of  evidence  are  worth  acquiring  before  actually  using  the  information 
sources.  Due  to  the  exponential  time  complexity  of  computing  non-myopic  VOI  for  multiple  information  sources,  most 
researchers  focus  on  the  myopic  VOI,  which  requires  the  assumptions  (“No  competition”  and  “One-step  horizon”)  that 
may  not  meet  the  requirements  of  real-world  applications. 

We  thus  proposed  an  algorithm  to  approximately  compute  non-myopic  VOI  efficiently  by  utilizing  the  central-limit  the¬ 
orem.  Although  it  is  motivated  by  the  method  of  [1 1  ],  it  overcomes  the  limitations  of  their  method,  and  works  for  more  gen¬ 
eral  cases,  specifically,  no  binary-state  assumption  for  all  the  nodes  and  no  conditional-independence  assumption  for  the 
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Table  5 

The  proposed  algorithm  vs.  the  algorithm  in  [11] 
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Our  algorithm 

Heckerman’s  algorithm 

Hypothesis  node  (0)  can  be  multiple  states 

Decision  node  (D)  can  have  multiple  rules 

Information  sources  nodes  (Os)  can  be  dependent  from  each  other 

0  has  to  be  binary 

D  has  to  be  binary 

Os  have  to  be  conditionally  independent  from  each  other 

information  sources.  Table  5  compares  our  method  with  the  method  in  [11].  Due  to  the  benefits  of  our  method,  it  can  be 
applied  to  a  much  broader  field.  The  experiments  demonstrate  that  the  proposed  algorithm  can  approximate  the  true 
non-myopic  VOl  well,  even  with  a  small  number  of  observations.  The  efficiency  of  the  algorithm  makes  it  a  feasible  solution 
in  various  applications  when  efficiently  evaluating  a  lot  of  information  sources  is  necessary. 

Nevertheless,  the  proposed  algorithm  focuses  on  the  influence  diagrams  with  one  decision  node  under  certain  assump¬ 
tions.  For  example,  currently,  we  assume  the  hypothesis  node  0  and  the  decision  node  d  are  independent.  If  D  and  0  are 
dependent,  but  conditionally  independent  given  the  observation  set  0,  Eqs.  (5)  and  (6)  will  not  be  affected,  so  our  algorithm 
can  still  apply.  However,  if  D  and  0  are  dependent  given  0,  it  may  be  difficult  to  directly  apply  our  algorithm.  Another  sce¬ 
nario  is  that  when  there  are  more  than  one  hypothesis  node  and/or  utility  nodes.  One  possible  solution  is  to  group  all  these 
hypotheses  nodes  into  one.  We  would  like  to  study  these  issues  in  the  future. 


Appendix 


Proposition  1.  Let  rjk  =  U|t_^+“22‘:_U2li.  Pk,  =  maxk<K,rjfc,  and  p*u  =  m\nVij<krjk,  then  dk  is  the  optimal  action  if  and  only  if 

Pw^P(0iK  PL- 


Proof  of  Proposition  1.  =>  In  this  direction,  we  prove  that  if  dk  is  the  optimal  action,  p(0 1)  >  maxk<j<qrjk  and 
p(0 i)  <  minUj<krjt. 

If  dk  is  the  optimal  action,  EU(dk)  must  be  larger  than  or  equal  to  the  expected  utility  of  any  other  action.  Based  on  the 
definition,  the  expected  utility  of  taking  the  action  dk  is  EU(dk)  =  p(0\)  *  ulk  +  p(02)  *  u2k,  where  ulk  =  u(6j,dk),  and 
u2k  =  u(d2,dk).  Therefore,  we  get  the  equations  as  follows: 


EU(dk)  >  EU(dj)  VjJ  ±  k, 

=>  p{6 1)  *  uu  +  p(02)  *  u2 k  >  P(0 i)  *  Uy  +  P(02)  *  u2j, 
u2j  -  u2k 


■P(0i)  > 


P(0 1 


u lit  -  tiy  +  U2 j  -  u2k 
U2j  ~  U2k 


rjk  if  j  >  k, 


U1  k  ~  Ulj  +  u2j  ~  U2k 


=  rJk  if  j  <  k. 


(27) 

(28) 

(29) 

(30) 


Thus,  based  on  the  above  equations,  p(0i)  >  maxk<j<qrjk  and  p(0 1)  ^  min1<)<fcrJt. 

=  In  this  direction,  we  prove  that  ifp(0i)  >  m axk<]siqrjk  and  p(0i)  ^  min1<J<trJk,  then  dk  is  the  optimal  action. 


If  P(0i)  > 
P(0i)  >  rjk 


maxk<jiqrjk  Vj, k<j  ^  q,  we  get 
u2j  - 1 hk 


Utk  ~  tip  +  U2j  -  U2k  ' 

P(0l)(Ulk  -  Ulj  +  U2j  -  U2k)  ^  U2 j  -  U2k, 

p(0 1)  *  Uu  +  (1  -  p(0i))  *  u2k  ^  p(0i)  *  uv  +  (1  -  p(0i))  *  u2j, 

EU{dk )  >  EU(dj). 


(31) 

(32) 

(33) 

(34) 


Similarly,  for  Vj,  1  <  j  <  k,  we  can  get  EU(dk)  >  EU(dj).  Therefore,  dk  has  the  maximal  expected  utility  and  thus  is  the  optimal 
decision. 


Proposition  2.  £oa, dp(o)  =  p(p"kl  ^  p{0k\o)  s:  p(J. 

Proof  of  Proposition  2.  Based  on  Proposition  1,  dk  is  the  optimal  decision  if  and  only  if  the  value  of  p(0i)  is  between  pkl  and 
pku.  Therefore,  given  an  instantiation  o,  the  probability  that  dk  is  the  optimal  decision  is  equal  to  the  probability  that  p(0t  |o)  is 
between  p(,  and  p’ku,  i.e„  p(ph  <  p(0,|o)  s:  p“ku). 

On  the  other  hand,  we  know  that  odk  is  a  subset  of  instantiations,  each  of  which  corresponds  to  the  optimal  action  dk. 
Therefore,  as  long  as  o  belongs  to  the  set  of  odk ,  dk  will  be  the  optimal  decision.  In  other  words,  the  probability  of  dk  being  the 
optimal  decision  is  the  sum  of  the  probability  of  each  o  e  odk ,  which  is  J2o€od  P(°)-  Therefore,  J2oeod  P(°)  = 

P(Pw^P(0il°XPL)- 
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A  bstract — In  our  previous  paper,  we  formalized  an  active  infor¬ 
mation  fusion  framework  based  on  dynamic  Bayesian  networks  to 
provide  active  information  fusion.  This  paper  focuses  on  a  central 
issue  of  active  information  fusion,  i.e.,  the  efficient  identification  of 
a  subset  of  sensors  that  are  most  decision  relevant  and  cost  effec¬ 
tive.  Determining  the  most  informative  and  cost-effective  sensors 
requires  an  evaluation  of  all  the  possible  subsets  of  sensors,  which 
is  computationally  intractable,  particularly  when  information- 
theoretic  criterion  such  as  mutual  information  is  used.  To  over¬ 
come  this  challenge,  we  propose  a  new  quantitative  measure 
for  sensor  synergy  based  on  which  a  sensor  synergy  graph  is 
constructed.  Using  the  sensor  synergy  graph,  we  first  introduce 
an  alternative  measure  to  multisensor  mutual  information  for 
characterizing  the  sensor  information  gain.  We  then  propose  an 
approximated  nonmyopic  sensor  selection  method  that  can  effi¬ 
ciently  and  near-optimally  select  a  subset  of  sensors  for  active 
fusion.  The  simulation  study  demonstrates  both  the  performance 
and  the  efficiency  of  the  proposed  sensor  selection  method. 

Index  Terms — Active  information  fusion,  Bayesian  networks 
(BNs),  sensor  selection,  situation  awareness. 

I.  Introduction 

INFORMATION  fusion  is  playing  an  increasingly  important 
role  in  improving  the  performance  of  sensory  systems  for 
various  applications,  including  situation  assessment,  enemy 
intent  understanding  and  prediction,  and  threat  assessment. 
As  sensors  become  ubiquitous,  persistent,  and  pervasive,  and 
coupled  with  the  ever  increasing  demand  for  less  time  and  fewer 
resources,  it  becomes  critically  important  to  perform  selective 
fusion  so  that  decision  can  be  made  in  a  timely  and  efficient 
manner.  The  need  for  sensor  selection  is  further  demonstrated 
by  the  availability  of  an  increasingly  large  volume  of  sensory 
data  and  by  the  variability  of  sensor  reliability  over  time  and 
over  location.  It  is  important  to  select  the  sensors  not  only  to 
reduce  the  amount  of  data  to  integrate  but  also  to  improve  fusion 
accuracy  by  selecting  the  most  reliable  sensors  for  a  certain 
location  at  a  certain  time,  by  selecting  complementary  sensors, 
and  by  reducing  sensor  redundancy.  Active  fusion  serves  these 
purposes  well.  Active  fusion  extends  the  paradigm  of  informa- 
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Fig.  1 .  BN  is  used  for  active  information  fusion,  where  ©  and  Si  are  hypoth¬ 
esis  and  sensors,  respectively.  X-L  and  Y-h  are  the  intermediate  variables,  and 
they  are  needed  to  model  the  relationships  among  sensors  and  the  hypothesis. 
Sensor  fusion  is  accomplished  through  probabilistic  inference  given  the  sensory 
measurements. 

tion  fusion  by  being  not  only  concerned  with  the  methodology 
of  how  to  combine  information  but  also  concerned  with  the 
fusion  efficiency,  timeliness,  and  accuracy.  Active  fusion  can 
be  defined  as  the  process  of  combining  data  with  a  control 
mechanism  that  dynamically  selects  a  subset  of  sensors  to 
minimize  uncertainty  in  situation  assessment  and  to  maximize 
the  overall  expected  utility  in  decision  making. 

In  our  previous  work  [1],  we  formalized  an  information  fu¬ 
sion  framework  based  on  Bayesian  networks  (BNs)  to  provide 
active  and  sufficing  information  fusion.  BNs  are  used  to  model 
a  number  of  uncertain  events,  their  spatial  relationships,  and 
the  sensor  measurements.  Given  the  sensory  measurements, 
information  fusion  is  performed  through  probabilistic  inference 
using  the  BN.  This  can  be  accomplished  through  bottom-up 
belief  propagation,  as  illustrated  in  Fig.  1.  Our  previous  work, 
however,  did  not  address  the  core  issue  in  active  fusion,  i.e., 
efficient  sensor  selection.  This  is  the  focus  of  this  paper. 

Based  on  information  theory  [2],  the  more  sensors1  we 
use,  the  more  information  we  can  obtain.  However,  every 
act  of  information  gathering  incurs  cost.  Sensor  costs  may 
include  physical  costs,  computational  costs,  maintenance  costs, 
and  human  costs  (e.g.,  risk).  Many  applications  are  often 
constrained  by  limited  time  and  resources.  An  essential  issue 
for  active  information  fusion  is  to  select  a  subset  of  the  most 

1  For  generality,  sensors  could  refer  to  any  devices/means  of  acquiring 
information.  For  example,  they  may  be  electromagnetic  or  acoustic  devices  or 
they  could  also  be  direct  observations  of  the  world  through  reconnaissance  and 
intelligence  gathering  activities. 


1 083-44 19/$26.00  ©  2009  IEEE 
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synergetic  sensors,  which  can  maximally  reduce  the  uncertainty 
about  the  events  of  interest  with  minimum  costs.  Dynamically 
determining  the  best  set  of  sensors,  given  the  uncertainty  about 
the  state  of  the  world,  requires  to  enumerate  all  the  possible 
subsets  of  sensors,  which  is  computationally  intractable  and 
practically  infeasible.  This  computational  difficulty  is  twofold. 
First,  the  computation  of  a  sensor  selection  criterion  such  as 
mutual  information  is  exponential  with  respect  to  the  number 
of  sensors.  Second,  searching  for  an  optimal  subset  of  sensors 
is  also  NP-hard,  since  the  sensor  space  exponentially  increases 
with  the  number  of  sensors.  To  address  this  computational 
difficulty,  a  common  practice  is  to  use  myopic  analysis, 
which  assumes  that  only  one  observation  will  be  available  at 
a  time,  even  when  there  is  an  opportunity  to  make  a  set  of 
observations  [3]— [6] .  There  is  a  vast  literature  on  the  problem 
of  single  optimal  sensor  selection  [7]— [9] .  However,  the  myopic 
approach  cannot  guarantee  to  obtain  the  best  evidences  that 
most  effectively  reduce  uncertainty  and  cost.  To  effectively 
reduce  uncertainty  and  cost,  one  should  use  nonmyopic 
selection,  which  simultaneously  considers  several  observations 
before  making  a  decision.  The  most  common  nonmyopic 
method  is  the  greedy  approach.  While  efficient,  it  cannot 
guarantee  optimality  with  the  selected  sensors.  Other  works 
try  to  overcome  the  limitations  with  the  greedy  approach,  yet 
with  their  own  strong  assumptions.  In  [10],  Heckerman  el  al. 
presented  an  approximate  nonmyopic  approach  based  on  the 
central-limit  theorem  in  an  influence  diagram  (ID)  for  effi¬ 
ciently  computing  the  value  of  information.  Their  method,  how¬ 
ever,  assumes  that  the  sensors  are  conditionally  independent 
of  each  other,  given  the  decision  variable,  and  that  the  decision 
variable  is  binary.  Krause  and  Guestrin  [11],  [12]  presented 
a  randomized  approximation  algorithm  for  selecting  a  near- 
optimal  subset  of  observations  for  graphical  models.  Under 
the  assumptions  that  the  sensors  are  conditionally  independent 
given  the  decision  variables,  the  information  gain  is  then 
guaranteed  to  be  a  submodular  function,  and  the  theory  of 
sub  modular  functions  can  then  be  applied  to  achieve  a  near- 
optimal  solution  in  selecting  a  subset  of  observations  using  a 
greedy  approach.  Recently,  Liao  and  Ji  [13]  have  presented  an 
approximation  algorithm  for  the  nonmyopic  computation  of 
the  value  of  information  in  an  ID.  Their  method  extends  the  ap¬ 
proach  in  [10]  without  requiring  the  sensors  being  conditionally 
independent  of  each  other  and  the  decision  node  being  binary. 

This  paper  takes  another  avenue  of  approach  to  efficiently 
select  a  subset  of  near-optimal  sensors  without  the  strong  sensor 
independence  assumptions,  as  made  in  [10]  and  [12].  Specifi¬ 
cally,  we  first  introduce  a  new  quantitative  measure  of  sensor 
synergy  based  on  mutual  information.  Based  on  the  synergy 
measure,  we  then  introduce  a  method  to  efficiently  compute 
the  least  upper  bound  (LUB)  of  mutual  information  for  a  set  of 
sensors.  Experiments  show  that  the  LUB  closely  approximates 
the  mutual  information  in  value,  as  shown  in  Figs.  5  and  6. 
Hence,  the  computational  difficulty  with  computing  the  exact 
nonmyopic  mutual  information  can,  therefore,  be  circumvented 
by  computing  its  LUB  instead.  In  addition,  the  synergy  measure 
can  also  be  used  to  prune  the  sensor  space,  which,  therefore, 
reduces  the  search  time  for  the  best  sensor  set.  A  summary  of 
the  mentioned  work  may  be  found  in  [14]. 


II.  Problem  Formulation 


The  problem  of  sensor  selection  for  active  fusion  can  be 
stated  as  follows:  Assume  that  there  are  m  sensors  Si,  i  = 
1 , ,77i,  available  that  provide  measurements  of  the  world.  Let 
0  be  a  set  of  hypothesis  Of.  of  the  world  situation  k  =  1, . . . ,  K. 
Let  S  =  {5i, . . . ,  Sn}  be  a  subset  of  n  sensors  selected  at  time 
t,  where  n  £  {1, . . . ,  m}.  Let  C(S)  be  the  cost  to  use  the  set  of 
sensors  S.  The  objective  of  sensor  selection  at  time  f  +  1  is  to 
select  a  subset  of  sensor  S*  to  achieve  the  maximal  utility,  i.e., 

S*  =  argmax(7(ui,  W2)  (1) 

S  €  s 

where  u±  and  11,2  denote  information  gain  (i.e.,  the  mutual 
information)  and  the  sensor  usage  cost  saving,  respectively,  S 
represents  all  the  possible  subsets  of  sensors,  and  U (111,112) 
is  a  utility  function.  Here,  we  use  w2  =  1  —  C(S)  to  convert 
the  sensor  usage  cost  to  the  corresponding  cost  saving,  which 
makes  u\  and  u2  qualitatively  equivalent.  For  simplicity,  in 
this  paper,  we  assume  that  the  cost  is  the  same  for  all  sensors. 
Hence,  we  can  ignore  w2 . 

The  major  difficulty  of  using  (1)  for  sensor  selection  is  to 
efficiently  compute  the  information  gain  ti  \ .  From  information 
theory,  the  entropy  of  hypothesis  0  given  a  sensor  Si  measures 
how  much  uncertainty  exists  in  0  given  Si,  i.e., 

h(o  |  =  -EE  P(0,  Si)  log  P(0  |  Si)  (2) 

Si  © 

where  st  denotes  a  reading  of  sensor  Si.2  Subtracting  iT(0  |  Si) 
from  the  original  uncertainty  in  0  without  5),  i.e.,  H (0),  yields 
the  expected  amount  of  information  about  0  that  S,  is  capable 
of  providing 


I(Q;  Si)  =H(0)~H(e\sz) 

=  ^P(d)  log  P(0) 

e 

+Elp(sk>Ep(0i<>logP(^) 

Si  l  © 


=  E  E  p(6'’ Si) log 

©  Si 


P(0  I  sj) 

m 


(3) 


where  1(0;  Si)  is  referred  to  as  the  mutual  information,  which 
characterizes  the  expected  total  uncertainty-reducing  potential 
of  0  due  to  Si.  The  mutual  information  for  a  sensor  set  S  = 
{Si, . . . ,  S)j}  can  be  obtained  by 


1(0;  S) 

=  H(Q)-H(Q  |  S) 

=  -y£P(6)  log  P(0) 

© 

+EE-E  {P(d,  si,...,sn)  log  p(e  |  si, . . . ,  «„)} 

©  Si  sn 


=  E  E "  •  E  \  p(°'  si>  ■  ■  •  > s™) log 

©  Sl  Sn  *• 


P(6\si,...,sn)\ 
P(9)  J 
(4) 


2Without  loss  of  generality,  here  we  assume  discrete  sensor  measure¬ 
ment.  The  theories  can  be  straightforwardly  extended  to  continuous  sensor 
measurements. 
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where  P(9,  si, . . . ,  sn)  and  P(6  \  s\, . . . ,  sn)  at  time  t  can 
directly  be  obtained  through  BN  inference.  The  mutual  infor¬ 
mation  in  (4)  provides  a  sensor  selection  criterion  in  terms  of 
the  uncertainty  reduction  potential,  i.e.,  mutual  information. 

It  is  clear  from  (4)  that  when  the  number  of  sensors  in  S  is 
large  or  when  the  number  of  states  for  each  sensor  is  large,  it 
becomes  computationally  impractical  to  simply  implement  this 
information-theoretic  criterion,  because  it  generally  requires 
time  exponential  in  the  number  of  summations  to  exactly 
compute  the  mutual  information.  The  remainder  of  this  paper 
addresses  this  computational  difficulty. 

III.  Approximation  Algorithm 

In  this  section,  we  give  a  graph-theoretic  definition  of  sensor 
synergy.  We  then  present  the  theorems  on  which  our  algorithm 
is  based. 


A.  Sensor  Synergy  in  Information  Gain 

Throughout  this  section,  it  is  assumed  that  we  have  obtained 
7(0;  Si,  Sj)  and  7(0;  S{),  i.e.,  the  mutual  information  of  all 
pairs  of  sensors  and  individual  sensors  with  respect  to  0, 
respectively.  We  will  introduce  an  efficient  method  to  obtain 
all  7(0;  Si,  Sj)  in  Section  III-C.  We  first  define  a  synergy 
coefficient  to  characterize  the  synergy  between  two  sensors,  and 
then  extend  this  definition  to  multiple  sensors. 

Definition  1  (Synergy  Coefficient):  A  measure  of  the  ex¬ 
pected  synergetic  potential  between  two  sensors  Sj  and  Sj  in 
reducing  the  uncertainty  of  hypothesis  0  is  defined  as 


7(0;  Sit  S0)  -  max (7  (0;  Sz),  7(0;  Sj)) 
77(0) 


(5) 


The  denominator  77(0)  in  (5)  is  to  restrict  nj  to  the  interval 
[0,  1],  It  can  easily  be  proved  that  i\j  >  0  based  on  the  “infor¬ 
mation  never  hurts”  principle  [2],  i.e.,  7(0;  Si,  Sj)  >7(0;  Sf), 
and  7(0;  Si,  Sj)  >  7(0;  Sj).  This  follows  that  Si  and  Sj  taken 
together  are  always  more  informative  than  when  they  are  taken 
alone.  The  larger  rl;i  is,  the  more  synergetic  the  sensors  .S', 
and  Sj  are.  Obviously,  r(-,  •)  is  symmetrical  in  S,  and  Sj,  and 
Tij  =  0  if  i=j. 

Definition  2  (Synergy  Matrix ):  Let  a  sensor  set  be  S  = 
{Si, . . . ,  SVJ.  The  sensor  synergy  matrix  is  an  n  x  n  matrix 
defined  as 


0  r12  •••  rln- 

T2i  0  •  •  •  r2n 

-rni  rn 2  •••  0  . 


(6) 


R  is  an  information  measure  of  synergy  among  sensors  that 
is  based  on  pairwise  sensor  synergy.  With  a  synergy  matrix,  we 
naturally  define  its  graphical  representation. 

Definition  3  (Synergy  Graph):  Given  a  sensor  synergy  ma¬ 
trix,  a  graph  G  =  (S,  E),  where  S’s  are  the  nodes  representing 
the  set  of  available  sensors,  and  E’s  are  the  links  representing 
the  set  of  pairwise  synergetic  links  weighted  by  synergy  coeffi¬ 
cients  Tij,  is  a  sensor  synergy  graph. 


Fig.  2.  Example  of  synergy  graph  with  five  sensors. 


Fig.  3.  (a)  Synergy  chain  (Si,  S2,  S3,  S4}  (highlighted)  on  a  pruned  synergy 

graph,  (b)  Corresponding  MSC. 

We  use  the  synergy  graph  to  graphically  represent  the  syn¬ 
ergy  among  multiple  sensors.  By  definition  of  the  synergy,  G  is 
a  complete  graph,  i.e.,  there  is  a  link  between  any  two  nodes  in 
the  graph.  Fig.  2  gives  an  example  synergy  graph  consisting  of 
five  sensors. 

Definition  4  (Pruned  Synergy  Graph):  A  pruned  synergy 
graph  is  created  from  a  synergy  graph  after  removing  some 
links.  A  pruned  synergy  is,  therefore,  not  a  complete  graph. 

Fig.  3  shows  an  example  of  a  pruned  synergy  graph.  To 
further  exploit  the  theoretical  properties  of  mutual  information 
7(0;  S)  for  a  set  of  sensors,  we  give  the  following  definitions. 

Definition  5  (Synergy  Chain):  Given  a  pruned  synergy  graph 
G ,  if  all  the  sensors  in  a  subset  on  G  are  serially  linked,  then 
this  subset  of  sensors  is  referred  to  as  a  sensor  synergy  chain. 
Note  that  while  the  sensors  in  a  set  S  are  generally  order 
independent,  the  sensors  in  a  synergy  chain  are  order  dependent 
and  sequentially  ordered. 

Definition  6  (MSC):  Given  a  synergy  chain  with  n  sen¬ 
sors,  for  all  i  =  l,...,n  —  1,  if  p{Si+i  |  Si,  S2, . . . ,  Si)  = 
p(Si+ 1  |  Si),  then  the  chain  that  describes  the  synergetic  re¬ 
lationship  among  {Si,...,Sn}  is  a  Markov  synergy  chain 
(MSC).  An  MSC  is  also  ordered. 

Fig.  3  graphically  shows  the  above  definitions  about  the 
synergy  chain  in  a  pruned  synergy  graph.  The  MSC  represents 
an  ideal  synergetic  relationship  among  sensors.  The  MSC  rarely 
exists  in  practice,  but  this  does  not  prevent  us  from  using  it  as  a 
basis  for  the  graph-theoretic  analysis  of  synergy  among  sensors. 
In  fact,  as  to  be  shown  later,  the  concept  of  MSC  is  used  to 
define  the  upper  bound  for  the  mutual  information  of  a  set  of 
sensors.  With  the  above  definitions,  we  give  the  following  two 
theorems. 
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Sensor  set:  1,  2,  3,  4 
Corresponding 
Markov 

Synergy  Chains: 
1234;  1243;  1324; 
1342;  2134  2431; 
3421;  312  4;  4312; 
4231;  4213 

Fig.  4.  Illustration  of  a  set  of  possible  MSCs  for  a  set  unordered  four  sensors 
in  a  pruned  synergy  graph. 

Theorem  1  (MSC  Rule):  Given  an  MSC  with  a  set  of  ordered 
sensors  S  =  {Si, . . . ,  Sn},  for  any  n ,  the  joint  mutual  informa¬ 
tion  with  respect  to  0  for  sensors  on  an  MSC  is 

7M(0;  Si,  . . . ,  Sn) 

n—  1 

=  /(©;  Sr)  +  Y,  (J(0;  Si,  Si+ 1)  -  7(0;  S<))  •  (7) 

i=l 

The  proof  of  this  theorem  can  be  found  in  Appendix  A.  We 
want  to  make  note  that  the  mutual  information  for  an  MSC  is 
sensor-order  dependent  due  to  the  pairwise  synergy  definition. 
The  significance  of  Theorem  1  is  that  it  allows  us  to  efficiently 
compute  the  joint  mutual  information  for  n  (n  >  2)  ordered 
sensors  as  a  sum  of  mutual  information  of  only  singleton  and 
pairwise  sensors  if  the  set  of  sensors  forms  an  MSC.  In  contrast 
to  (4),  the  computational  cost  of  (7)  is  dramatically  reduced. 
Although  (7)  is  particularly  for  an  MSC,  the  theorem  above  has 
some  useful  properties  that  can  be  used  for  the  solution  of  our 
sensor  selection  problem. 

Theorem  2  (Synergy  Upper  Bound):  For  a  set  of  unordered 
sensors  S  =  {Si, . . . ,  Sn},  its  mutual  information  is  upper 
bounded  by  the  mutual  information  of  the  corresponding 
MSC,  i.e., 

/(©;  Si, ... ,  Sn)  <  IM(@]  Si, . . . ,  Sn).  (8) 

The  proof  of  this  theorem  is  provided  in  Appendix  B.  Please 
note  that  while  7(0;  ,Sj , . . . ,  Sn)  is  sensor-order  independent, 
I A/ ('0:  Si, . . . ,  Sn)  is  sensor-order  dependent.  As  a  result,  de¬ 
pending  on  the  order  of  sensors  in  S,  different  MSCs  may  be 
produced.  Let 

7m{„  =  arg  nun  (/M(0;S)) 

7mlx  =  arg m|x  (7M(0;  5))  (9) 

where  S  denotes  all  the  possible  orders  of  a  sensor 
set  {Si, . . . ,  Sn}.  I^in  is  referred  to  as  the  LUB  of 
1(0;  Si, ... ,  Sn),  and  7^x  is  referred  to  as  the  greatest  upper 
bound  (GUB)  of  7(0;  Si, . . . ,  Sn).  For  example,  in  Fig.  4,  the 
sensor  set  S  =  {Si,  S2,  S3,  S4}  has  multiple  MSCs,  as  given  in 
this  figure,  and  there  exist  a  LUB  and  a  GUB  of  7(0;  S). 

We  are  particularly  interested  in  the  LUB  of  7(0;  S)  due 
to  two  reasons.  First,  it  can  be  seen  from  Figs.  5  and  6  that 
the  LUBs  of  7(0;  S)  closely  follow  the  trend  of  7(0;  S)  in 
the  entire  space  of  sensor  subsets.  Second,  the  exact  value  of 
7(0;  S)  and  its  LUB  are  quantitatively  very  close  in  value. 
Thus,  7*{n(0;  S)  provides  a  substitute  measure  for  7(0;  S) 


Fig.  5.  Bound  of  mutual  information  /(©,  S)  and  its  exact  value  from  a  six¬ 
sensor  BN  model.  The  X-axis  represents  the  indexes  of  41  sensor  subsets. 
Labels  1-20  are  the  indexes  of  the  three-sensor  subsets;  Labels  21-35  are  the 
indexes  of  the  four-sensor  subsets;  and  Labels  36-41  are  the  indexes  of  the 
five-sensor  subsets. 


Fig.  6.  Bound  of  mutual  information  7(©,5)  and  its  exact  value  from  a 
ten-sensor  BN  model.  The  X-axis  represents  the  indexes  of  sensor  subsets. 
For  clarity,  the  figure  only  shows  66  subsets  out  of  627.  Labels  1-18  are  the 
indexes  of  the  five-sensor  subsets;  Labels  19-34  are  the  indexes  of  the  six¬ 
sensor  subsets;  Labels  35-51  are  the  indexes  of  the  seven-sensor  subsets;  and 
Labels  52-66  are  the  indexes  of  the  eight-sensor  subsets. 

that  can  be  used  to  evaluate  an  optimal  sensor  subset.  Im¬ 
portantly,  the  LUBs  of  7(0;  S)  can  simply  be  written  as  the 
sum  of  the  mutual  information  of  only  pairwise  sensors  and 
singleton  sensors,  as  shown  in  (7),  hence,  with  relatively  very 
low  computational  cost.  Therefore,  the  computational  difficulty 
in  exactly  computing  the  higher-order  mutual  information  can 
be  circumvented  by  only  computing  the  LUBs  of  the  mutual 
information.  This  is  the  central  strategy  of  our  approach. 

B.  Pruning  Synergy  Graph 

The  synergy  graph  is  a  completely  connected  network  due  to 
the  weights  of  synergy  graph  rtJ-  >  0.  Some  sensors  are  highly 
synergetic,  whereas  others  are  not.  Intuitively,  sensors  that 
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TABLE  I 

Example  of  Synergy  Coefficient  Without  Pruning 


ra 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

l 

0 

0.0004 

0.0034 

0.0034 

0.0029 

0.0023 

0.0035 

0.0035 

0.0031 

0.0017 

2 

0 

0.0004 

0.0004 

0.0004 

0.0003 

0.0004 

0.0004 

0.0004 

0.0002 

3 

0 

0.0215 

0.0196 

0.0156 

0.0099 

0.0122 

0.0212 

0.0113 

4 

0 

0.0184 

0.0147 

0.0099 

0.0122 

0.0198 

0.0105 

5 

0 

0.0930 

0.0083 

0.0104 

0.0650 

0.0659 

6 

0 

0.0066 

0.0082 

0.0517 

0.1329 

7 

0 

0.0100 

0.0091 

0.0050 

8 

0 

0.0112 

0.0060 

9 

0 

0.0388 

10 

0 

cause  a  very  small  reduction  in  uncertainty  of  hypotheses  are 
those  that  give  us  the  least  additional  information  beyond  what 
we  would  obtain  from  other  sensors.  In  such  cases,  rtj  is  very 
small.  We  prune  the  sensor  synergy  graph  so  that  many  weak 
sensor  combinations  are  eliminated  while  preserving  the  most 
promising  ones.  This  can  significantly  reduce  the  search  space 
in  identifying  the  optimal  sensor  subset.  We  prune  the  synergy 
matrix  (the  weights  of  the  synergy  graph)  in  (6)  by  using 


f  1,  rij  >  t 

y  0,  otherwise 


(10) 


where  r  is  a  pruning  threshold.  The  selection  of  an  appropriate 
threshold  r  is  problem  dependent.  We  want  to  note  that 
although  there  is  no  theoretical  basis  to  determine  a  good 
pruning  threshold,  our  empirical  tests,  however,  show  that 
using  the  arithmetic  average  of  i\j  as  the  pruning  threshold  can 
preserve  most  of  the  strong  synergetic  connections  in  the  graph 
while  eliminating  weak  links.  After  pruning,  a  fully  connected 
synergy  graph  then  becomes  a  sparse  graph.  Table  I  and 


-oooooooooo- 
000000000 
01100010 
0  1  0  0  0  1  1 

0  10  0  11 

0  0  0  1  1 

0  0  0  0 

0  0  0 
0  1 

0. 


(11) 


are  examples  of  the  synergy  coefficient  before  and  after  prun¬ 
ing.  Fig.  7  illustrates  their  corresponding  synergy  graph  from  a 
completely  connected  network  to  a  sparse  graph  after  pruning. 


C.  Computing  Pairwise  Mutual  Information 

In  the  above  sections,  we  assumed  that  we  have  known  the 
mutual  information  of  the  pairwise  sensors  7(0;  Si,  Sj).  For 
n  sensors,  there  are  (n(n  —  l)/2)  pairs  of  sensors.  To  obtain 
the  mutual  information  for  one  pair  of  sensors,  it  requires 
four  repetitions  of  inferences  if  the  sensor  state  is  binary. 
Therefore,  2 n(n  —  1)  repetitions  of  inference  are  needed  for 


Fig.  7.  (a)  Completely  connected  synergy  graph  and  the  links  are  weighted  by 

r ij ,  as  shown  in  Table  I.  (b)  Pruned  synergy  graph  and  its  corresponding  matrix 
as  shown  in  (1 1).  The  pruning  threshold  is  the  average  of  rij,  and  it  is  0.0161. 


all  pairs  of  sensors.  Although  this  computation  is  manageable, 
it  still  severely  limits  the  performance  as  n  becomes  large. 
Fortunately,  there  is  an  efficient  way  to  compute  the  mutual 
information  for  all  pairs  of  sensors  [15],  [16]. 

Referring  to  Fig.  1,  the  joint  probability  of  hypothesis  0  and 
pairwise  sensors  {Si,Sj}  may  be  written  as  in  (12),  shown 
at  the  bottom  of  the  next  page,  where  n(x)  represents  the 
parental  nodes  of  node  x.  From  (12),  it  can  be  observed  that  the 
first  factor  P(0)  Uti  P(*k  \  *(Xk))  n"=i  P(Ym  \  n(Ym)) 
is  related  to  the  part  of  the  BN  structure  that  does  not  include 
the  sensors.  The  structure  is,  therefore,  fixed,  and  so  are  its 
probabilities.  Hence,  this  term  is  constant,  independent  of  the 
pair  of  sensors  used.  On  the  other  hand,  the  second  factor 
{T,suSi,...,sN,ljH  j  lln=i  p(Sn  I  *(Sn))}  varies,  depending 
on  the  pair  of  sensors  selected.  Therefore,  we  do  not  need  to 
recalculate  the  unchanged  part  (the  first  factor)  of  (12)  at  each 
time.  Instead,  we  only  need  to  compute  it  once  for  all  pairs  of 
sensors,  but  use  it  over  time  so  that  the  computation  of  pair¬ 
wise  mutual  information  can  significantly  be  curtailed.  Given 
P(0,  Si,  Sj),  it  can  then  be  substituted  into  (4)  to  compute 
7(0;  Si,  Sj).  Details  of  this  method  can  be  found  in  [15]  and 
[16].  Fig.  8  illustrates  the  comparative  result  of  time  saving  in 
computing  (12)  for  all  pairs  of  sensors  by  using  our  method 
and  by  directly  using  two  inference  algorithms,  namely,  clique 
tree  propagation  (CTP)  [17],  [18]  and  variable  elimination  (VE) 
[19].  The  evaluation  is  performed  on  a  six-layer  BN  model  with 
10,  15,  and  20  sensors. 
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Fig.  8.  Comparison  of  time  saving  among  CTP,  VE,  and  our  method  in 
computing  (12)  for  all  pairs  of  sensors.  It  can  be  seen  that  our  method  can 
significantly  save  time. 

D.  Approximation  Algorithm 

We  are  now  ready  to  provide  the  complete  algorithm.  Let 
S  denote  the  current  set  of  selected  sensors,  and  let  Zub(0;  S) 
be  the  LUB  of  1(0;  S).  The  approximation  sensor  selection 
algorithm  is  given  in  Table  II.  Guided  by  the  pruned  synergy 
graph,  the  algorithm  starts  with  the  best  pair  of  sensors  iden¬ 
tified  through  an  exhaustive  search  and  then  searches  for  the 
next  best  sensor.  The  next  best  sensor  is  the  one,  when  added 
to  the  current  sensor  ensemble,  that  yields  the  highest  utility, 
which  is  computed  from  lub(<d;  S).  This  process  repeats,  with 
one  sensor  added  to  the  current  sensor  ensemble  at  a  time,  until 
the  newly  added  sensor  does  not  yield  an  improvement  in  sensor 
utility.  Although  the  algorithm  is  greedy,  the  searching  process 
is  guided  by  a  synergy  graph  so  that  the  selected  sensor  subset 
is  serially  connected.  This,  therefore,  ensures  both  the  quality 
and  the  speed  of  sensor  selection. 

IV.  Algorithm  Evaluation 

Since  the  main  contribution  of  this  paper  is  the  introduction 
of  an  alternative  measure  to  mutual  information  for  efficient 
sensor  selection,  the  experimental  evaluation  should  focus  on 
the  effectiveness  of  this  measure  for  both  sensor  selection  accu¬ 
racy  and  efficiency.  We  want  to  emphasize  that  the  alternative 
measure,  i.e.,  the  LUB  of  mutual  information,  is  an  approxima¬ 
tion  of  the  mutual  information  only  for  the  purpose  of  sensor 
selection.  As  a  result,  the  quality  of  this  approximation  should 


TABLE  II 

Pseudocode  of  the  Approximation  Algorithm 
to  Select  a  Subset  of  Sensors 

Sensor- Selection(7i) 

1  for  each  i,j,  compute  7(©;Si)  and  7(0;  Si,  Sj) 

2  Construct  a  pruned  synergy  graph  G 

3  Choose  Sj*,  Sj*  such  that  7(0;  S),  S)) 

is  maximized  for  all  i  and  j 

4  S<-  {Si*, S^} 

5  while  |S|  <  m 

6  for  each  S',  where  |S'|  =  |S|  +  1,  and  S'  is  a  synergy 

chain  on  G  and  S  C  S' 

7  Find  all  Markov  synergy  chains  of  S' 

9  lu6(0;  S')  <—  argmin7(0;  S'), 

where  7(0;  S')  is  computed  by  Eq.  (7),  and 
min  is  taken  over  all  Markov  synergy  chains  of  S' 

10  S'*  <—  arg max !«&(©;  S') 

where  max  is  taken  over  all  S' 

11  if  («&(©;  S'*)  >  lub(Q;  S) 

12  S  <-  S'* 

1 3  else  break 

14  return  S 

be  evaluated  against  its  performance  in  sensor  selection.  For 
this,  we  propose  to  measure  how  close  the  sensor  selection 
results  using  the  alternative  measure  are  to  those  based  on 
mutual  information.  The  closeness  between  a  sensor  subset 
selected  using  the  alternative  measure  and  a  sensor  subset 
selected  based  on  mutual  information  is  quantified  by  the  rela¬ 
tive  difference  in  mutual  information.  Based  on  this  criterion, 
we  will  experimentally  evaluate  the  proposed  method  under 
different  BN  topologies,  different  BN  model  complexities,  and 
different  number  of  sensors. 

Given  two  different  criteria  (mutual  information  and  its 
LUB)  for  measuring  sensor  gain,  sensor  selection  can  be  car¬ 
ried  out  by  using  different  methods.  We  will  perform  sensor 
selection  using  the  following  methods:  1)  brute-force  method; 
2)  random  method,  which  randomly  chooses  one  sensor  at  a 
time  to  form  a  sensor  ensemble;  and  3)  the  proposed  method. 
These  experiments  try  to  demonstrate  the  following:  1)  The 
proposed  LUB  criterion  suboptimally  works  for  different  meth¬ 
ods.  2)  Given  the  same  sensor  selection  criterion,  the  proposed 
greedy  approach  outperforms  the  random  sensor  selection 
method. 


K 


M 


N 


p(e,sitsJ)= 

Si  ,Si  ly^j  L  k—1  ? 

K  M 

=  P(&)  ]JP(xk\  Trpffc))  P  (Ym  |  n(Ym)) 


P(0)  ll  P  (Xk  I  n(Xk))  P  {Ym  1 7T(ym))  n  P  (Sn  1 7 r(5„)) 


n= 1 
N 


k=l 


m=  1 


Si,Si  l^j  n=  1 


(12) 
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Fig.  9.  Generic  example  of  the  BN  network  used  for  evaluation,  where  the 
top  layer  is  for  the  hypothesis,  and  the  bottom  layer  is  for  the  sensors.  The 
intermediate  layers  are  arbitrarily  and  randomly  connected. 

We  first  compare  the  performance  of  the  proposed  sensor 
selection  method  in  Table  II  with  the  brute-force  method. 
The  brute-force  method  exhaustively  identifies  the  best  sensor 
subset  by  the  exact  mutual  information.  The  study  is  done  by 
using  different  numbers  of  sensors  and  different  BN  topologies. 
Fig.  9  shows  a  generic  example  of  a  BN  used  for  the  evaluation. 

Due  to  the  exponential  time  with  the  brute-force  approach, 
we  limit  our  test  models  to  up  to  five  layers  and  up  to  ten 
sensors  or  less.  The  exact  number  of  layers,  the  connections 
among  nodes  in  the  intermediate  layers,  and  the  number  of 
sensors  are  randomly  generated  so  that  ten  different  BNs  with 
different  topologies  are  generated.  For  each  randomly  generated 
BN  topology,  its  parameters  are  randomly  parameterized  ten 
times  to  produce  ten  differently  parameterized  BNs  for  each 
selected  topology.  This  yields  a  total  of  100  test  models.  Fig.  10 
shows  two  examples  of  BNs  used  for  this  paper. 

The  results  averaged  among  100  trials  are  shown  in  Table  III, 
where  the  closeness  is  defined  as  the  relative  difference  in 
mutual  information  between  the  solution  from  our  approach  and 
the  solution  from  the  brute-force  approach.  It  can  be  seen  that 
the  solution  with  our  method  is  close  to  the  sensor  selection 
results  using  the  brute-force  method.  For  further  comparison, 
the  run  time  of  the  two  methods  is  measured  on  a  2.0-GHz  com¬ 
puter,  and  the  run  time  averaged  among  ten  trials  is  summarized 
in  Table  III.  Compared  with  the  brute-force  method,  our  method 
significantly  reduces  the  computation  time  with  minimum  loss 
in  sensor  selection  accuracy. 

To  demonstrate  the  improvement  of  the  proposed  method 
over  random  sensor  selection,  the  results  of  random  sensor 
selection  are  also  included  in  Table  III.  For  a  fair  comparison, 
we  first  use  our  method  to  select  a  best  subset  and  then  use  the 
random  method  to  select  a  subset  of  the  same  size  using  the 
same  criterion.  To  account  for  the  random  nature  of  random  se¬ 
lection,  the  results  are  averaged,  and  the  averaged  result  is  used 
to  compare  against  the  result  from  our  method.  Compared  with 
the  random  sensor  selection,  our  method  shows  a  significant 
improvement  in  sensor  selection  accuracy. 

Finally,  we  want  to  note  that  the  randomly  generated  BN 
topologies  (for  example,  the  BNs  in  Figs.  9  and  10)  may  not 


necessarily  satisfy  the  assumption  needed  for  Theorem  1  to 
hold.  Despite  this,  the  selected  sensors  remain  close  (in  mutual 
information)  to  those  selected  by  the  brute-force  method,  as 
demonstrated  in  Table  III.  We  also  repeat  the  above  experiments 
by  using  naive  BNs.  The  parameters  of  BNs  are  randomly 
generated.  We  selected  k  sensors  from  n  sensors  (k  <  n) 
without  considering  sensor  costs.  The  sensors  selected  by  the 
brute-force  method  and  by  our  approach  have  no  difference. 

V.  Conclusion 

It  is  computationally  difficult  to  identify  an  optimal  sen¬ 
sor  subset  with  the  information-theoretic  criterion.  To  address 
problem,  we  have  presented  an  approximation  method  to  find 
a  near-optimal  sensor  subset  by  utilizing  the  sensor  pairwise 
information  to  infer  the  synergy  among  sensors.  Specifically, 
this  paper  includes  the  following  aspects:  First,  we  propose  to 
use  a  BN  to  represent  sensors,  their  dependencies,  and  their 
relationships  to  other  latent  variables.  In  addition,  the  built- 
in  conditional  independence  assumptions  with  the  BNs  allow 
factorizing  the  joint  probabilities  so  that  fusion  can  efficiently 
be  performed.  Second,  we  introduce  a  statistical  measure  to 
quantify  the  pairwise  synergy  among  sensors.  Based  on  the 
synergy  measure,  a  synergy  graph  is  constructed,  which  is  used 
to  infer  synergy  among  multiple  sensors,  based  on  which  we 
can  then  eliminate  many  unpromising  sensor  combinations. 
Finally,  for  the  remaining  sensor  combinations,  a  greedy  ap¬ 
proach  is  introduced  to  identify  the  optimal  sensor  combination 
based  on  the  LUB  of  the  joint  mutual  information.  The  use  of 
the  LUB  of  the  joint  mutual  information  instead  of  the  joint 
mutual  information  itself  significantly  reduces  the  computation 
time  with  minimum  loss  in  accuracy.  We  demonstrate  both  the 
optimality  and  the  efficiency  of  the  proposed  method  through 
many  random  simulations  under  different  numbers  of  sensors 
and  different  relationships  among  sensors. 

A  major  assumption  of  this  paper  is  that  the  two  sensors  are 
conditionally  independent  of  each  other,  given  another  sensor 
between  the  two  sensors  and  the  fusion  result.  This  assumption 
could  limit  the  utility  of  this  paper.  As  part  of  the  future 
research,  we  will  study  ways  to  overcome  this  assumption. 
Another  assumption  we  made  in  this  paper  is  that  all  the 
sensors  have  the  same  cost.  Such  an  assumption  is  not  realistic 
for  many  applications.  Overcoming  this  assumption,  however, 
requires  incorporating  the  sensor  cost  into  the  proposed  synergy 
function,  which  is  a  nontrivial  task.  We  will  study  this  issue  in 
the  future  as  well. 

Appendix 

In  the  following,  we  introduce  our  proof  for  Theorems  1  and  2. 
A.  Proof  of  Theorem  1 

Before  proving  Theorem  1,  we  give  the  following  lemma. 

Lemma  1  (Chain  Rule  of  Mutual  Information):  Letting  X, 
Y\ , . . . ,  Ym  be  random  variables,  then 

M 

I(X;  Y1,...,Ym)  =  /(X;  YJ  +  ^  I(X;  Yt  \  Yu  . . . ,  F;_i). 

2  —  2 

(13) 
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Fig.  10.  Two  specific  examples  of  BN  structures  with  different  numbers  of  sensors  used  for  the  evaluation. 

TABLE  III 

Comparison  of  the  Proposed  Method  and  the  Brute-Force  Method 


Number 

Our  Approach 

Random  Method 

Brute-Force 

of 

Relative  mutual  information  difference 

Run  time 

Relative  mutual  information  difference 

Run  time 

Sensors 

of  our  method  to  brute  force  methods 

(Seconds) 

of  random  method  to  brute  force  methods 

(Seconds) 

7 

1 .56% 

1.020 

21.13% 

63.87 

8 

1.77% 

1.099 

28.32% 

355.05 

9 

2.75% 

1.209 

36.54% 

2967.36 

10 

1 .89% 

1.430 

39.19% 

13560.54 

The  proof  of  Lemma  1  is  straightforward  [2].  We  now  turn  to 
proving  Theorem  1 . 

Proof:  Based  on  Lemma  1,  we  have 

/(©;  Si, ,  Sm) 

=  7(0;  Si)+7(0;  S2  |  Si)+7(0;  S3  \  Si,  S2) 

+  7(0;  S4  |  Sr,  S2,  S3)  +  -  ■  -+7(0;  Sm  \  Si, . . . ,  Sro_r). 

(14) 


=  5Z  P(9,Sl,S2,S3) 
e,Si,s2,s3 


f ,  P(S3\0,S2)  } 
I8  p(s3\s2)  I 


=  51  p(0’s2>s  3) 

e,s2,s3 


r 

l  P{S3  |  S2)  J 


=  51  P(0’S2,S3) 
o,s2,s3 


Jj,  p(a3,0\s2)  ) 

lgp(0|S2MS3|s2)J 


=  I(0;S3\S2). 


(16) 


We  start  with  an  MSC  containing  four  random  variables 
{0,  Si,  S2, 53},  then  extend  it  to  five  variables,  and  finally 
to  a  finite  number  of  arbitrary  random  variables  forming  an 
MSC.  Notice  that  0  is  the  hypothesis,  and  Si,  S2,  S3  are  the 
sensors. 

Based  on  Definition  6  of  MSC,  Si  and  S3  are  conditionally 
independent  given  S2,  i.e.,  P(S3  \  Si,  S2)  =  P(S3  \  S2).  First, 
we  prove  that  the  following  equation  holds  when  .S)  and  S3  are 
conditionally  independent  given  S2: 

7(0;  S3  |  S2)  =  7(0;  S3  |  Si,  S2)  (15) 

7(0;  S3  |  Si,  S2) 

=  5Z  Sl)  s2,  S3) 

0,Si,S2,S3 

=  5Z  p(6',si,  52,53) 

e,Si,s2,s3 

=  5Z  P(^,5l,S2,S3) 
e,Si,s2,s3 


lg 


lg 


lg 


p(9,S3  I  Si,52) 
p(9\si,s2)p(s3\si,s2) 

p(Q,s3,si,s2) 
p(9,si,s2)p(s3  I  s2) 

p(s3  |  0,si,s2) 

P(s3  | s2) 


Please  note  that  for  the  derivations  in  (16),  we  assume  that 
p(S3  |  0,  Si,S2)  =  p(S3  |  0,  S2),  i.e.,  S3  and  Si  are  condition¬ 
ally  independent  given  both  0  and  S2,  where  0  is  a  random 
variable  representing  the  fusion  result,  and  S)  is  a  sensor.  The 
typical  relationships  between  0  and  Si  are  illustrated  in  Fig.  1, 
where  0  is  typically  the  root  node,  and  Sfs  are  the  leaf  nodes  in 
the  BN.  Given  this  understanding,  if  the  BN  is  such  that  the  path 
(undirected  path)  between  two  sensor  nodes  (e.g..  Si  and  S3) 
goes  through  0  node  (e.g.,  the  BN  in  Fig.  1),  then  following  the 
D-separation  principle  for  BN,  p(S3 10,  Si,  S2)  =  p(S3|0,  S2) 
holds.  Please  note  that  this  assumption  only  holds  for  some 
BNs,  such  as  the  one  in  Fig.  1  and  the  naive  BN.  It  may  not 
hold  for  an  arbitrary  BN. 

From  the  chain  mle  of  mutual  information,  we  have 

7(0;  S3,  S2)  =  7(0;  S2)  +  7(0;  S3  |  S2).  (17) 

Hence,  combining  (16)  and  (17)  yields 

7(0;  S3  |  Si,  S2)  =  7(0;  S2,  S3)  -  7(0;  S2).  (18) 
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Now,  we  want  to  apply  the  similar  algebraic  process  to  prove 
7(0;  S4  |  Si,  S2,  S3)  =  /(O;  S4  |  S3)  in  (19),  shown  at  the 
bottom  of  the  page,  given  the  Markov  conditions  that  P(S4  \  S2, 

S3)  =  P(S4  |  S3),  P(5i  |  S3,  S2)  =  P(S1 1  S2),  P(S4  |  Sr, 
S2)  =  P(S4  |  S2),  and  P(S4  |  Si,  S3)  =  P(S4  |  S3).  By  mutual 
information  chain  rule,  we  have  7(0;  S3,  S4)  =  7(0;  S3)  + 
7(0;  S4  |  S3),  i.e., 

7(0;  S4  |  S3)  =  7(0;  S3,  S4)  -  7(0;  S3).  (20) 

Combining  (19)  and  (20)  produces 

7(0;  S4  |  S1,S2,S3)  =7(0;  S4  |  S3) 

=  7(0;  S3,  S4)- 7(0;  S3).  (21) 

Finally,  we  can  generalize  the  above  process  to  prove 

7(0,  Sm  |  Sr,  S2, . . . ,  STO_r) 

=  7(0;  Sm_i,  Sm)  ~  7(0;  Sm_r).  (22) 

Substituting  the  results  in  (18),  (21),  and  (22)  into  (14)  yields 

7(0;Si,S2,...,Sro) 

=  7(0;  Sr)  +  7(0;  S2  |  S4)  +  7(0;  S3,  S2) 

-  7(0;  S2)  +  7(0;  S3,  S4)  -  7(0;  S3)  +  •  •  • 

+  7(0;  Sm_i,Sm)  -  7(0;  Sm_r) 

=  7(0;  Sr)  +  7(0;  S2,  Sr)  -  7(0;  S4)  +  7(0;  S3,  S2) 

-  7(0;  S2)  +  7(0;  S3,  S4)  -  7(0;  S3)  +  •  •  • 

+  7(0;  Sm_i,  Sm)  -  7(0;  Sm_r) 

M 

=  7(0;  Sr)  +  ^  7(0;  Sm_4,  Sm)  -  7(0;  Sm_4).  (23) 

i=2 

This  completes  the  proof  for  Theorem  1 .  ■ 


B.  Proof  of  Theorem  2 

Proof:  We  want  to  prove 

7(0;  Sr, ... ,  Sm)  <  7M(0;  Su  . . . ,  Sm).  (24) 
From  the  mutual  information  chain  rule,  we  have 

7(0;  Sr,  S2, . . . ,  Sm) 

=  7(0;S1)+7(0;S2|S1) 

+  7(0;  S3  |  Sr,  S2)  +  7(0;  S4  |  S4,  S2,  S3)  +  --- 
+  7(0;  STO  |  Sr,  S2, . . . ,  Sm_r).  (25) 

By  Theorem  1,  we  have 

7m(0;  Sr, ... ,  Sm) 

=  7(0;  Sr)  +  7(0;  S2  |  Sr)  +  7(0;  S3  |  S2) 

+  7(0;  S4  |  S3)  +  •  •  •  +  7(0,  Sm  |  Sm_r).  (26) 

By  the  definition  of  mutual  information,  we  have 

7(0;  S3;  S2;  Sr)  =  7(0;  S3;  S2)  -  7(0;  S3;  S2  |  Sr)  (27) 
which  readily  leads  to 

7(0;  S3;  S2  |  Sr)  =  7(0;  S3  |  S2)  -  7(0;  S3  |  S2,  Sr).  (28) 
Hence 

7(0;  S3  |  S2,  Sr)  =  7(0;  S3  |  S2)  -  7(0;  S3;  S2  |  Sr).  (29) 
Therefore 

7(0;  S3  |  S2)  >7(0;  S3  |  Sr,  S2). 

Please  note  that  we  assume  here  that  7(0;S3;S2  |  Sr)  >  0, 
which  is  correct  since  for  our  application  0  (the  hypothesis) 


7(©;S4|Si,S2,S3) 

,Q  r /,  p(01s4\s1,s2,s3)  1 

>  p(6,  Si,S2,  S3,  S4)  <  lg  —T— - — - r 

ga  a  l  P(^  S1,S2,S3)p(S4S1,S2,S3)J 


e,s1,s2,s3,s4 


(a  \  fi  p(0,S4,S1,S2,S3)  \ 

e,Si,s^s3,s4  I  P(0,S1,S2,S3)P(S4\S3)) 

Y^  (a  \  fi  p(si\0i  sr>  s2,  s3)  1 

E  {lg  P(S<|S3)  } 


e,Si,s2,s3,S4 


e,Si,s2,s3,s4 


=  p(0,s3,s4)  Ug 

e,s3,s4 

=  P(°^s3,S4)  Ug 

e,s2,s3  *• 


p{s4\0,  s3) 
p(s4|s3) 

p{s4,0\s3)  \ 

p{0  I  S3)p(s4|s3)  J 


=  7(S4;0|S3)  =7(0;S4|S3) 


(19) 
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and  the  other  variables  (sensors)  are  not  independent  of  each 
other. 

Similarly,  we  have  7(0;  S4  |  S3)  >  7(0;  S4  \  Si,S2,S3)  and 

1(0;  Sm  |  Sm_r)  >  I(Sm  |  SltS2,  Sm- 1). 

Hence,  (26)  >  (25).  The  equality  sign  holds  when  the 
Markov  property  between  neighbor  sensors  is  true. 

Hence,  this  completes  the  proof  for  Theorem  2.  ■ 
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Abstract  -  Active  learning  methods  seek  to  reduce  the 
number  of  labeled  instances  needed  to  train  an  effective  classifier. 
Most  current  methods  are  myopic,  i.e.  select  a  single  unlabelled 
sample  to  label  at  a  time.  The  batch-mode  active  learning 
methods,  on  the  other  hand,  typically  select  top  N  unlabeled 
samples  with  maximum  score.  Such  selected  samples  often  cannot 
guarantee  the  learner’s  performance.  In  this  paper,  a  non-myopic 
active  learning  algorithm  is  presented  based  on  mutual 
information.  Our  algorithm  selects  a  set  of  samples  at  each 
iteration,  and  the  objective  function  of  the  algorithm  is  proved  to 
be  submodular,  which  guarantees  to  find  the  near-optimal 
solution.  Our  experimental  results  on  UCI  data  sets  show  that  the 
proposed  algorithm  outperforms  myopic  active  learning. 

Index  Terms  -  Non-myopic  active  learning,  Mutual 
information,  Submodular  function. 


I.  Introduction 

Most  of  the  current  study  in  active  learning  has  focused 
on  selecting  a  single  unlabeled  sample  in  each  iteration.  In  this 
case,  the  classifier  has  to  be  retained  after  each  selected 
sample  is  labeled.  These  myopic  approaches  are  also  not 
suited  to  the  parallel  labeling  environment.  By  contrast,  batch¬ 
mode  active  learning  methods  select  a  set  of  unlabeled 
samples  at  a  time.  They  provide  a  solution  to  multiple 
annotators  and  model  with  slow  training  procedures.  A  simple 
strategy  toward  selecting  a  query  batch  is  to  myopically  query 
the  top  N  queries  according  to  a  given  query  strategy.  Methods 
based  on  such  a  strategy  do  not  work  well,  since  they  greedily 
select  the  "best"  unlabeled  samples  at  each  iteration  in  local 
and  fail  to  consider  the  overlap  in  information  content  among 
the  “best”  instances  [1]. 

To  address  this  issue,  a  few  improved  batch-mode  active 
learning  algorithms  have  been  proposed.  Brinker  [2]  considers 
an  approach  for  SVMs  that  explicitly  incorporates  diversity 
among  instances  in  the  batch.  Xu  et  al.  [3]  propose  a  similar 
approach  for  SVM  active  learning,  which  also  incorporates  a 
density  measure.  Specifically,  they  query  cluster  centroids  for 
instances  that  lie  close  to  the  decision  boundary.  Hoi  et  al.  [4] 
extend  the  Fisher  information  framework  to  the  batch-mode 
setting  for  binary  logistic  regression.  But  most  of  these 
approaches  still  use  greedy  heuristics  to  ensure  that  instances 
in  the  batch  are  both  diverse  and  informative  without  a 
guarantee  of  near-optimal  solution. 

In  this  paper,  we  present  a  non-myopic  active  learning 
algorithm  to  query  instances  in  groups.  In  active  learning  one 
solution  to  decide  which  sample  to  label  is  based  on  its  effect 
on  the  remaining  unlabeled  data.  So,  our  aim  is  to  select  a  set 


of  unlabeled  samples  which,  if  labeled,  can  maximize  the 
confidence  (certainty)  on  remaining  unlabeled  data.  Based  on 
this  idea,  we  exploit  mutual  information  to  construct  the 
objective  function. 

The  objective  function  of  our  algorithm  is  proved  to  be 
submodular,  which  guarantees  the  greedy  method  can  find 
near-optimal  sets  to  be  added  into  training  data  so  that  active 
learner  can  improve  in  both  accuracy  and  efficiency. 

This  paper  is  structured  as  follows.  Section  II  gives  the 
description  of  the  objective  function  of  our  non-myopic  active 
learning  method  and  the  proof  of  submodularity  of  the 
objective  function.  Section  III  presents  our  algorithm.  Section 
IV  contains  experimental  results  on  Chess  database  and  Tic- 
Toe-Tac  Endgame  database  from  the  UCI  Machine  Learning 
Repository.  Section  V  summarizes  the  contributions  made  in 
this  paper. 

II.  The  objective  function  and  its  submodularity 

The  aim  of  our  non-myopic  active  learning  is  to  select  a 
subset  of  K  unlabeled  samples  in  each  iteration,  such  that 
when  the  samples  are  given  true  labels  and  added  to  the 
training  set  D ,  the  learner  trained  on  the  augmented  training 
set  can  result  in  the  maximum  classification  uncertainty 
reduction  on  the  test  set.  So  the  objective  function  is  as 
follows. 


A  =  &xgmaxR(A),s.t.  |  A  |<  K,  (1) 

AaU 

where  A  is  a  subset  of  unlabelled  training  data,  and  U  is  the 
unlabelled  data  pool.  For  a  set  A  ,  the  uncertainty  reduction 
function  R(A),  i.e.,  the  mutual  information  criterion  between 
A  and  U  can  be  defined  as 


R(A)  =  Hd(U)-Hd(U\A\A),  (2) 

where  HD(.)  is  the  entropy  for  measuring  the  classification 
uncertainty  on  the  remaining  unlabelled  data  given  a  current 
learner.  Assuming  the  initial  small  set  of  labelled  data  D ,  a 
large  pool  of  unlabeled  data  U  =  {xl,x2,...,xM}  ,  and 

x,e  U,HD{.)  can  be  computed  by 


Hd(U) 
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Let  A  =  { be  a  subset  of  U  and  xt&U\A  , 
Hd(U  \  A  |  A)  is  measuring  the  expected  uncertainty  on  the 
remaining  data  set  U\A  after  A  is  selected.  Since  before  we 
make  the  query  for  set  A  ,  yI,---,ym,  the  true  labels  for 

each  r,  •••  r  ,  are  unknown,  so  the  conditional 
entropy  H  (U  \A\A)  can  be  written  as 

hd(u\a\A)=  1  *  £  ZZ'"Z_(40i 

I  ^  I  x^efA/f^ekyier  y,„ek 

(4) 

In  Equation  4,  PD(yk  \xx,yl,...,xm,ym,xk)  is  the 
estimated  output  distribution  given  D  ,  Xk  and  some  possible 
labels  yu  ■  ■  ■,  ym  for  x, ,  ■  ■  ■ , xm  ,  i.e., 

hiVk  \xl,y,,...,xm,ym,xk)  =  \xk).  (5) 

In  [5],  it  showed  that,  if  the  objective  function  F  is 
submodular,  then  a  greedy  algorithm,  which  starts  with  the 
empty  set  A  =  O  ,  and  iteratively  adds  to  A  the  element 
5*  =  argmaxF(^4)  until  k  elements  have  been  selected,  finds 

S 

a  near-optimal  set.  Its  result  uses  the  concept  of 
submodularity,  i.e.,  an  intuitive  diminishing  returns  property:  a 
new  observation  decreases  our  uncertainty  more  if  we  know 
less.  According  to  this  conclusion,  we  give  a  proof  of  the 

submodularity  of  R  as  follows. 

Proof  of  the  submodularity  of  R  : 

Let  4c5c(/  and  unselected  samples X <£  A  ,  and  if 

R  is  submodular,  then  it  should  hold  that  “diminishing 
returns”  property, 

R(AkjX)-R(A)>R(BkjX)-R(B).  (6) 

Since  H D(.)  is  the  entropy  of  the  learner’s  posterior,  it 
holds  the  property  of  entropy,  i.e.,  the  ‘information  never 
hurts’  principle  [6], 

Hd(X\A)<Hd(X),  (7) 

i.e.,  adding  A  with  true  labels  cannot  increase  the  entropy 
[7].  Because  the  marginal  increase  of  R  can  be  written  as 


R(AuX)-R(A) 

=  Hd(U)-Hd(U\AuX\AuX) 
-Hd(U)+Hd(U\A\A) 

=  Hd{U\A\A)-Hd{U\AkjX\AkjX) 

=  Hd{U)-Hd{A)-Hd{U)  +  Hd(A  uX) 

=  Hd(AuX)-Hd(A ) 

=  Hd(X\A). 

(8) 

In  Equation  8,  we  used  the  chain  rule  of  the  joint  entropy,  i.e. 

H(Y\X)  =  H(X,Y)-H(X).  (9) 

Submodularity  is  simply  a  consequence  of  information 
never  hurts  principle: 

R(A ul)- R(A)  =  Hd{X  |  A)  >  Hd(X  \  B)  =  R(B  ul)- R(B). 

(10) 

So  we  prove  that  R  is  submodular. 

III.  A  NON-MYOPIC  ACTIVE  LEARNING  ALGORITHM 

The  problem  of  maximizing  submodular  function  is  NP- 
hard.  The  greedy  algorithm  is  an  efficient  algorithm  to  reduce 
the  computation,  and  achieves  near-optimal  results  if 
submodularity  property  holds.  We  exploit  the  greedy 
algorithm  to  iteratively  find  each  sample  for  a  batch  A  .  In  the 
step  of  greedily  finding  A ,  selecting  a  sample  will  be  stopped 
when  the  current  HD(U  \  A  \  A)  is  larger  than  the  previous 
one  or  when  the  number  of  selected  samples  is  larger  than  K . 

After  selecting  a  set  A  ,  the  algorithm  adds  A  with  true 
labels  into  the  labeled  training  data.  The  learner  is  retrained  on 
new  training  set.  If  the  prediction  accuracy  of  current  learner 
is  not  satisfied,  the  process  is  repeated. 

Algorithm: 

Initialization:  Randomly  select  a  small  set  D  from  unlabeled 
samples  pool,  then  assign  a  class  label  to  each  of  them,  next 
construct  an  initial  training  set.  Train  the  classifier  C  using  D . 
While  stopping  criterion  (here  prediction  accuracy)  is  not 
satisfied 

1.  Greedily  find  A; 

2.  Add  A  with  true  labels  to  D  to  form  D' ; 

3.  Retrain  the  classifier  C  from  D' ,  and  obtain  its  prediction 
accuracy  on  test  date  set. 

Because  the  il !)(U)  is  always  the  constant  while  iteratively 
adding  xm  to  A,  we  can  get  the  unselected  sample  Xm  which 

makes  the  minimum  HD  (U  \  A  \  A)  in  the  process  of  finding 
A  .  The  process  of  Greedily  find  A  is  as  follows. 


Greedily  find  A: 


While  \A\<K 

Do 

x*  =  arg  min  HD(U  \  A  \  A); 

xmeU\A 

If  Hd(U  \  A  u  {x*}  |  A  u  {x'm})  <  Hd{U  \A\A) 

A  =  A  +  {x*m}; 

Else 

Break; 

End 

IV.  Experimental  results 

Two  benchmark  data  sets  from  UCI  Machine  Learning 
Repository  are  used  to  evaluate  the  performance  of  our 
algorithm.  They  are  for  binary  classification  task.  We  choose 
the  TAN  classifier  as  a  classification  algorithm.  In  the  step  of 
Greedily  find  A,  let  K  -  3  .  To  evaluate  the  performance  of  our 
approach,  we  compare  the  results  of  our  approach  (non- 
myopic)  with  the  results  from  the  expected  log  loss  reduction 
algorithm  of  myopic  active  learning  and  “N-best”  batch 
method. 


Chess  data  set 


number  of  selected  samples 

Fig.  1  The  comparison  results  by  running  Chess  data  set. 

The  first  data  set  is  from  chess  (King-Rook  vs  King- 
Pawn)  database.  The  data  set  is  randomly  partitioned  into 
training  set  of  75  instances  including  22  initially  labeled 
examples  and  53  unlabeled  examples.  The  independent  test  set 
consists  of  85  instances. 

Figure  1  shows  the  resulting  accuracy  of  three  algorithms 
as  the  function  of  number  of  selected  samples.  The  maximum 
possible  accuracy  is  98%  after  all  the  unlabeled  data  has  been 
labeled.  It  can  be  seen  that  after  a  few  queries  (6  queries)  our 
algorithm  (non-myopic)  can  have  a  higher  accuracy  than 
myopic  active  learning  and  N-best  samples  methods.  And  after 
8  iterations  our  non-myopic  active  learning  gets  98%  accuracy 


with  the  classifier  is  retrained  eight  times.  However  myopic 
active  learning  needs  to  retrain  the  classifier  23  times  for  97% 
accuracy.  Our  non-myopic  active  learning  is  more  efficient 
and  accurate  than  myopic  active  learning  and  N-best  methods. 

The  second  data  set  for  our  experiment  is  from  Tic-  Toe- 
Tac  database.  The  data  set  is  randomly  partitioned  into 
training  set  of  203  instances,  in  which  59  labeled  examples 
and  144  unlabeled  examples  are  included,  and  independent 
test  set  of  172  instances. 


Tic-Toe-Tac 


Fig.  2  The  comparison  results  by  running  tic-toe-tac  data  set. 


The  comparison  results  are  given  in  Figure  2.  The 
maximum  possible  accuracy  is  97%  after  all  the  queries.  After 
10  iterations,  our  algorithm  reaches  97%  accuracy.  However 
myopic  active  learning  needs  to  retrain  the  classifier  39  times 
for  97%  accuracy.  Again,  the  results  show  that  non-myopic 
active  learning  approach  outperforms  myopic  active  learning 
approach  and  N-best  method. 

V.  Conclusion 

Most  current  methods  for  active  learning  myopically 
select  the  ‘best’  sample  or  ‘N-best’  samples  in  local  to  label. 
These  methods  cannot  achieve  the  near-optimal  results,  and 
does  not  work  well.  By  contrast,  non-myopic  active  learning 
can  query  samples  in  groups  at  each  iteration.  If  the  objective 
function  is  submodular,  then  it  can  find  near-optimal  sets  and 
has  a  better  performance  than  myopic  active  learning. 
Meanwhile  it  is  efficient  in  training  model  and  suited  to 
parallel  labeling  environment.  In  this  paper  we  proposed  a 
non-myopic  active  learning  based  on  the  mutual  information, 
and  proved  the  objective  function  is  submodular.  So  the 
experimental  results  show  that  it  outperforms  myopic  active 
learning. 
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Abstract  -  Active  learning  methods  seek  to  reduce  the 
number  of  labeled  instances  needed  to  train  an  effective  classifier. 
Most  current  methods  assume  the  availability  of  some 
reasonable  amount  of  initially  labeled  training  data  so  that  the 
learners  can  be  trained  with  sufficient  quality.  However, 
for  many  applications,  the  amount  of  initial  training  data  is 
often  limited,  this  will  affect  the  quality  of  the  initial 
learners,  which,  in  turn,  affect  the  performance  of  the  active 
learning  methods.  In  this  paper,  we  introduce  a  new  non- 
parametric  active  learning  strategy  that  can  perform  well  even 
under  very  limited  initial  training  data.  Our  method  selects  the 
query  instance  that  simultaneously  maximizes  its  label 
uncertainty  and  the  classification  accuracy  on  the  unlabelled 
test  data.  Our  method  hence  avoids  selecting  outliers  and  does 
not  require  good  initial  learner.  The  experimental  results  with 
benchmark  datasets  show  that  our  method  outperforms  state  of 
the  art  methods  especially  when  the  amount  of  the  initially 
labeled  data  is  small  or  when  the  quality  of  the  initially  labeled 
data  is  poor. 

Index  Terms  -  Active  learning.  Minimal  total  entropy 
reduction.  Limited  initial  labeled  data. 

I.  Introduction 

Active  learning  aims  to  achieve  greater  accuracy  with 
fewer  labeled  training  instances  by  selecting  the  most 
informative  data  to  label  [1].  Active  learning  selects  unlabeled 
examples  for  labeling  if  the  predicted  label  is  highly  uncertain. 
Based  on  this  view,  some  existing  works  in  active  learning 
have  concentrated  on  two  approaches:  Uncertainty-based 
method  [3],  [4],  [5]  and  committee-based  method  [6].  The 
former  estimates  sample  uncertainty  using  one  classifier  while 
the  latter  does  so  by  a  committee  of  classifiers.  Both 
approaches  examine  each  unlabeled  sample  one  at  a  time, 
often  independent  of  the  remaining  unlabeled  samples. 
Although  both  approaches  can  get  to  the  classification 
boundary  fast,  they  are  also  susceptible  to  outliers  since 
outliers  are  also  often  uncertain.  In  fact,  studies  show  that  both 
uncertainty  sampling  and  the  committee-based  methods  often 
fail  by  selecting  outliers  [7],  [8]. 

To  alleviate  this  problem,  one  solution  to  decide  which 
sample  to  label  is  not  only  based  on  the  properties  of  selected 
sample  but  also  on  its  effect  on  the  remaining  unlabeled  data. 
So,  from  a  different  perspective,  if  the  instance  to 
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be  labeled  can  maximize  the  confidence  (certainty),  i.e. 
makes  the  current  learner  to  have  a  good  generalization  error 
over  the  unlabeled  data,  the  instance  with  true  label  to  be 
added  into  labeled  data  set  can  improve  the  performance  of  the 
classifier.  Based  on  this  idea,  Roy  and  McCallum  first 
proposed  the  estimated  log  loss  reduction  (ELLR)  framework 
for  text  classification  using  naive  Bayes  [7].  Zhu  et  al. 
combined  this  framework  with  a  semi-supervised  learning 
approach,  resulting  in  a  dramatic  improvement  over  random  or 
uncertainty  sampling  [8].  Guo  and  Greiner  employ  an 
“optimistic”  variant  [9],  but  their  formulation  is,  in  fact, 
equivalent  to  minimizing  the  expected  future  log  loss.  The 
ELLR  framework  has  the  dual  advantage  of  being  near 
optimal  and  not  dependent  on  the  model  class.  While 
effectively  addressing  the  outlier  issue,  these  methods  suffer 
one  major  limitation.  They  typically  require  a  reasonably  good 
initial  learner  to  start  with  since  their  methods  use  the 
P(y  |  Xm ,  D )  determined  by  labeled  data  to  weight  the  log 

loss.  This  makes  the  performance  of  the  estimated  error 
reduction  further  depended  on  the  quality  of  initial  model  [10]. 
These  methods  perform  poorly  for  cases  that  have  a  very  small 
amount  of  initial  labeled  training  data. 

To  overcome  the  limitations  of  the  above  methods,  we 
propose  to  combine  the  uncertainty  based  method  with  the 
generalization  error  method.  The  selected  sample  by  our 
method  considers  both  its  own  uncertainty  and  its  potential  to 
reduce  the  classification  uncertainties  on  unlabelled  data.  In 
our  method,  the  instance  to  be  labeled  possesses  high 
uncertainty  on  its  label  and  at  the  same  time  should  maximize 
the  classification  confidence  (certainty)  on  unlabelled  data. 
The  uncertainty  of  each  sample  can  be  characterized  by  its 
entropy  and  its  effect  on  the  reduction  of  the  uncertainty  of  the 
unlabeled  data  can  be  characterized  by  its  Minimal  Total 
Entropy  Reduction  (MTER). 

The  main  benefit  of  our  algorithm,  compared  with  the 
ELLR  used  by  Roy  et  al  [7],  is  that  it  decreases  the  influence 
of  using  the  current  learner  to  approximate  the  label 
distribution  for  weighting  the  log  loss,  which  is  very  sensitive 
to  the  number  of  labeled  training  samples  and  their  quality.  In 
addition,  the  minimum  total  entropy  criterion  is  more 
conservative  and  strict  than  the  log  loss,  therefore  leading  to 
improved  performance.  In  paper  [2]  a  similar  active  learning 
method,  Mm+M,  is  proposed,  and  it  employs  ’most 
uncertainty’  approach  to  change  the  selection  rule  based  on 


minimum  total  entropy  when  it  encounters  an  unexpected 
label.  In  fact,  this  method  still  selects  a  sample  by  using  either 
generalization  error  method  or  uncertainty-base  method  but 
not  using  both  criteria  at  the  same  time.  Experiments  show 
that  our  active  learning  can  quickly  achieve  the  considerable 
accuracy  with  fewer  labeled  samples  than  the  state-of-the-art 
methods,  such  as  ELLR,  uncertainty-based,  Mm+M,  in 
particular  when  the  quality  of  the  initial  labeled  data  is  poor  or 
the  number  of  the  initial  training  data  is  small.  In  the  sections 
to  follow,  we  describe  our  method  in  pool-based  active 
learning,  and  its  performance. 

II.  Active  Learning  with  Pool-Based  Sample 

Pool-based  active  learning  is  an  interactive  learning 
technique  designed  to  reduce  the  labour  cost  of  labeling  in 
which  the  learning  algorithm  can  freely  assign  the  unlabeled 
data  to  the  training  set.  Active  learning  starts  from  an  initial 
labeled  examples  and  lets  the  learner  iteratively  update  its 
training  set  while  learning  at  each  step  from  the  new 
knowledge  gain  provided  by  newly  labeled  examples.  An 
overview  of  pool  based  active  learning  can  be  seen  in  Figure 
1.  The  classifier  C(-)is  trained  by  the  labeled  examples/, , 

and  then  a  selection  function  Sf  (■)  selects  the  most 

appropriate  examples  S  from  an  unlabeled  data  pool  U 
given  the  knowledge  already  acquired  by  the  learner. 


III.  A  New  Query  Strategy 

Our  active  learning  approach  is  based  on  actively 
identifying  and  annotating  the  samples  which  will  result  in 
maximum  uncertainty  on  its  label  and  in  the  meantime  yields 
the  maximum  certainty  on  class  labels  of  the  remaining 
unlabeled  data.  Pool-based  active  learning  typically  includes  a 
small  set  of  labeled  data  L  =  {(x],y[),---,(xk,yk)}  and  a 

large  pool  of  unlabeled  data  U  =  {xk+x,"',Xm,---,XN}  . 

Assume  we  select  Xm  from  U  for  the  next  instance  of 
human  labelling.  Our  objective  function  is 

x*m  =  argma  xM(x„,), 

xmeU 


where  M (xm)  is  the  sum  of  two  terms:  the  generalization 

error  reduction  on  the  unlabeled  data  and  the  label  uncertainty 
of  the  selected  sample.  It  is  defined  as 

M(xm)  =  R(xm)  +  H{Ym\xm,L).  (2) 

In  Equation  2,  R(xm)  is  to  measure  the  error  reduction  of 

a  candidate  sample  xm  over  remaining  unlabeled  samples, 
and  it  can  be  defined  as 

R(xJ  =  E(Yu\Xu,L)-E(Yu\Xu,xm,ym,L).  (3) 

is(.)is  a  generalization  error  function.  Since  the  first  term 
in  Equation  3  does  not  depend  on  the  instance  Xm  selected, 
so  we  can  rewrite  Equation  3  as 

R(xm )  =  ~E (Yv  \Xu,xm,  ym , L),  (4) 

and  Equation  2  as 

M{xm)  =  -E{Yu\Xu,xm,ym,L)  +  H{Ym\xm,L).  (5) 

First,  for  evaluating  the  classification  performance  on  the 
unlabeled  data,  i.e.  generalization  error,  the  learner  is  initially 
trained  on  the  labeled  data  L  .  Once  trained,  given  an  input 
Xn  from  U  ,  it  produces  a  probability  distribution  on  its 

label  y n  ,  based  on  which  yn  can  be  determined.  Given  Xm  ,  its 

label  ym  and  the  existing  labeled  data,  we  can  construct  a 
classifier  that  can  estimate  the  probability  distribution  of  label 
y n  for  each  unlabeled  data  Xn  ,  i.e., 

pn,m  =  p(y»  I  (xx,yx)Xx2,y2),---,{xk,yk),xm,ym,xn),(6) 

where  k  <  rn,n  <  N  and  n  ^  in  .  The  error  (uncertainty)  of 
estimated  label  y  for  input  X  can  be  characterized  by  its 
entropy 

]Pn,,j0gP,um,  (7) 

v.et 

where  Y  is  all  possible  outcome  labels.  Then,  the  total  entropy 
for  all  remaining  unlabeled  data  in  the  pool 

given  (xm ,  ym  )  can  be  computed  as 

E(Yu\  Xu,x„,y„,L)  = 

n=k+ 1 
n^m 


(i) 


(8) 


Thus  one  criterion  of  our  active  learning  approach  is  to 
select  a  query,  Xm  ,  such  that  when  the  query  is  given  true 

label  y m  and  added  to  the  training  set,  the  learner  trained  on 

the  resulting  set  ( L  +  (xrn ,  ym ))  results  in  the  maximum 

reduction  on  the  uncertainty  of  the  labels  of  the  remaining 
unlabeled  samples  in  the  pool,  i.e.  the  smaller  entropy  for  all 
remaining  unlabeled  data.  Meanwhile,  before  we  make  the 
query,  ym  ,  the  true  label  for  Xm  ,  is  unknown.  Thus,  Equation 
8  can  be  approximated  by  computing  the  minimum  estimated 
uncertainty  for  Xm  over  each  possible  label  ym  ,  i.e. 


E(Yu\Xu,xm,ym,L)  =  min  j  (^(-P.JogPJ).  (9) 

'1"'G  n=k+ 1  yneY 
npm 


Xm  is  the  next  sample  to  be  selected  for  labeling,  which  is 

added  into  L  .  The  learner  is  then  retrained  on  L  ,  and  the 
process  repeats  until  the  stopping  criterion  is  satisfied. 
Equation  9  is  the  minimum  total  entropy,  which  is  different 
from  the  expected  entropy  used  in  [7].  Computing  the 
expected  entropy  requires  the  current  learner  to  estimate  the 

current  classifier’s  posterior P(ym  \  Xm ,  L )  for  a  candidate  to 
compute  the  weight  for  each  label.  In  the  case  of  small  number 
of  initial  labeled  samples,  P(ym  \  Xm ,  L )  cannot  be  estimated 
accurately  and  the  expected  log  loss  reduction  computed  in 
this  case  will  not  be  accurate  either.  Elowever,  P(}’m  \  Xm ,  L) 

is  not  needed  in  evaluating  the  generalization  error,  our  query 
strategy  hence  is  less  dependent  on  both  the  quality  and 
quantity  of  initial  labeled  data  set. 

On  the  other  hand,  the  selected  sample  should  possess  high 
uncertainty  on  its  own  label.  Elence,  the  second  criterion  in 
Equation  5  can  be  computed  by  the  entropy  base  on  current 
learner,  i.e., 

H (Ym  I  xm,L)  =  -  Z(P(ym  |  xm,L)\ogP(ym  \  xm,L)). 


Our  objective  function  considers  both  the  uncertainty  of  a 
candidate  and  its  potential  to  reduce  the  uncertainties  of  the 
unlabelled  data.  So  it  selects  the  unlabeled  sample  with  the 
maximum  uncertainty  and  maximum  classification  error 
reduction  on  the  unlabelled  data.  This  therefore  leads  to 
improved  performance  under  very  limited  initial  labeled  data 
as  will  be  demonstrated  in  our  experiments.  The  algorithm  is 
summarized  in  Table  I. 

The  above  algorithm  is  computationally  intensive.  Several 
methods  can  be  used  to  improve  the  algorithm  efficiency 
including  sampling  and  clustering  the  unlabelled  data  pool  so 
that  only  seeds  are  considered  for  labeling.  Another  alternative 


to  speedup  the  algorithm  is  to  use  an  incremental  learning 
mechanism  to  learn  the  classifier  as  the  labeled  data  is 
gradually  made.  As  our  focus  is  on  the  learning  method,  we 
will  not  discuss  this  issue  in  this  paper. 

IV.  Experimental  Results 

Two  benchmark  data  sets  from  UCI  Machine  Learning 
Repository  are  used  to  evaluate  the  performance  of  our 
method.  One  is  for  binary  classification  task  and  the  other  is 
for  multiple  classification  task.  We  choose  the  TAN  classifier 
as  a  classification  algorithm.  To  evaluate  the  performance  of 
our  approach,  we  compare  the  results  of  our  approach 
(UNMTER)  with  the  results  from  the  expected  log  loss 
reduction  algorithm  (ELLR)  in  [7],  Mm+M  in  [2]  and  most 
uncertainty  method. 


TABLE  I 

UN-MTER  Algorithm 

1:  Initialization:  Randomly  select  a  small  set  of  samples  from  unlabeled 
sample  pool  U  ,  assign  a  class  to  each  sample  of  them,  then  construct  an 
initial  training  set  L  .  Train  the  classifier  C  using  L  . 

2:  While  stopping  criterion  (here  prediction  accuracy)  is  not  satisfied 

3:  Compute  E{YU  \  Xu,xm,ym,L)  for  each  Xm  from  U  using 

Equation  9; 

4:  Compute |  X  ,L)  for  each  from  JJ  using  Equation  1 0; 

5:  Compute M{x  )  f°r  each  Xm  from!/  using  Equation  5; 

* 

6:  Select  Xm  with  the  maximum  M  {xm  ) ; 

*  *  * 

7:  Add  X  with  true  label  y  to  L  to  form  L  ,  where 

L*+  =  L  +  (x* ,  y*m ) ; 

8:  Retrain  classifier  C  from  Z,+ ,  and  obtain  predication  accuracy. 


The  first  data  set  is  from  Tic-Tac-Toe  Endgame  database, 
which  consists  of  958  instances.  The  data  set  is  randomly 
partitioned  into  training  set  of  203  instances  including  10 
initially  labeled  examples  and  193  unlabeled  examples.  The 
independent  test  set  consists  of  172  instances. 

Figure  2  shows  the  resulting  accuracy  of  four  algorithms 
as  the  function  of  number  of  selected  samples.  The  maximum 
possible  accuracy  is  97%  after  all  the  unlabeled  data  has  been 
labeled.  In  this  experiment,  each  of  active  learners 
sequentially  selects  45  instances  from  unlabeled  pool  and  adds 
to  the  labeled  set.  It  can  be  seen  that  after  21  queries  our 
algorithm  (UN-MTER)  and  Mm+M  reach  65%.  In  contrast, 
the  ELLR  reaches  54%.  After  23  queries  our  algorithm  keeps 
getting  the  higher  accuracy  than  Mm+M  and  ELLR. 
Meanwhile  it  is  showed  that  the  most  uncertainty  method 
(UN)  has  the  worst  performance.  The  result  demonstrates  that, 
our  approach  outperforms  Mm+M,  ELLR  and  the  most 
uncertainty  under  very  limited  initial  labeled  data  set. 
However,  ELLR  can  match  UN-MTER  as  the  labeled  data  set 
contains  a  substantial  number  of  samples.  Nevertheless,  our 
method  is  practically  useful  for  some  application  domains  in 
which  the  availability  of  initially  labeled  data  is  restricted. 


Tic-Tac-Toe  dataset 


The  second  data  set  for  our  experiment  is  from  Nursery 
database.  The  data  set  is  randomly  partitioned  into  training  set 
of  182  instances,  in  which  16  labeled  examples  and  166 
unlabeled  examples  are  included,  and  independent  test  set  of 
282  instances.  This  data  set  is  for  a  multiple  classification  task 
with  5  class  attributes. 


Nursery  data  set 


The  comparison  results  are  given  in  Figure  3.  The 
maximum  possible  accuracy  is  66%  after  all  the  queries.  After 
20  queries,  our  algorithm  reaches  51%.  ELLR  and  Mm+M,  on 
the  other  hand,  only  reached  35%  and  33%  respectively.  Since 
then,  UN-MTER  has  faster  speedup  than  other  methods. 
Again,  the  results  show  that  our  approach  outperforms  ELLR, 
Mm+M  and  UN  when  the  number  of  initial  labeled  data  is 
very  limited. 


Most  current  methods  for  active  learning  assume  the 
availability  of  some  reasonable  amount  of  initially  labeled 
training  data  with  sufficient  quality.  However,  in  many 
applications,  the  amount  and  the  quality  of  initial  training  data 
are  often  limited.  This  will  affect  the  quality  of  the  initial 
learners,  which,  in  turn,  affect  the  performance  of  the  active 
learning  methods.  To  address  this  issue,  we  introduce  the 
method  based  on  maximizing  the  minimum  total  entropy 
reduction  on  the  unlabeled  data  and  maximum  uncertainty  of  a 
sample  on  the  current  learner.  The  experimental  results  with 
benchmark  datasets  show  that  our  method  outperforms  the 
state-of-the-art  methods  especially  when  the  amount  of  the 
initially  labeled  data  is  small  or  when  the  quality  of  the 
initially  labeled  data  is  poor. 
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Abstract 

This  paper  describes  a  new  approach  to  unify  con¬ 
straints  on  parameters  with  training  data  to  perform 
parameter  estimation  in  Bayesian  networks  of  known 
structure.  The  method  is  general  in  the  sense  that  any 
convex  constraint  is  allowed,  which  includes  many  pro¬ 
posals  in  the  literature.  Driven  by  a  maximum  entropy 
criterion  and  the  Imprecise  Dirichlet  Model,  we  present 
a  constrained  convex  optimization  formulation  to  com¬ 
bine  priors,  constraints  and  data.  Experiments  indicate 
benefits  of  this  framework. 


1  Introduction 

Bayesian  Networks  (BNs)  encode  a  joint  probabil¬ 
ity  distribution  for  a  set  of  random  variables  in  a  com¬ 
pact  graph  structure.  The  problem  of  parameter  learning 
concerns  the  estimation  of  probability  measures  of  con¬ 
ditional  probability  distributions,  given  the  graph  struc¬ 
ture  of  the  BN.  Many  techniques  depend  heavily  on 
training  data.  Ideally,  with  enough  data,  it  is  possible 
to  learn  parameters  by  standard  statistical  analysis  like 
maximum  likelihood  (ML)  estimation.  However,  data 
may  be  insufficient,  leading  to  inaccurate  estimations. 

This  paper  proposes  a  framework  for  the  parame¬ 
ter  learning  problem  that  combines  data  and  domain 
knowledge  in  the  form  of  constraints.  There  are  two 
types  of  constraints:  soft  constraints  on  priors  and  hard 
constraints  on  estimations.  Driven  by  the  Imprecise 
Dirichlet  Model  [17],  prior  beliefs  are  encoded  using 
a  set  of  Dirichlet  distributions.  Combined  with  data  and 
constraints  on  estimations,  the  result  is  a  set  of  estima¬ 
tions,  on  which  we  apply  the  maximum  entropy  prin¬ 
ciple  to  obtain  a  single  estimation.  Constraints  on  esti¬ 
mations  are  viewed  as  hard  constraints  and  assumed  to 
be  correct.  Thus  only  general  and  certainly  valid  con¬ 
straints  shall  be  used.  On  the  other  hand,  constraints  on 
priors  are  soft  because  estimations  are  adapted  and  cor¬ 
rected  by  training  data  even  if  some  soft  constraints  are 


wrongly  stated.  Altogether,  we  can  encode  constraints 
that  are  certain  as  well  as  less  reliable  constraints. 

There  are  many  approaches  to  parameter  learning  of 
BNs  using  constraints.  For  instance,  penalty  functions 
can  be  employed  [1],  but  global  optimality  is  not  al¬ 
ways  guaranteed.  Isotonic  regression  is  also  applicable, 
but  its  complexity  is  high  [7].  Non-convex  optimiza¬ 
tion  also  leads  to  high  complexities  [5],  Closed-form 
solutions  were  investigated  [9,  10],  but  they  do  not  al¬ 
low  overlap  of  constraints  (the  same  parameters  may  not 
appear  in  different  constraints).  Because  of  that,  many 
types  of  constraints  can  not  be  represented.  The  frame¬ 
work  presented  here  tries  to  overcome  such  limitations. 

2  Problem  definition 

A  Bayesian  network  is  a  triple  (G,  X ,  V),  where 
G  is  a  directed  acyclic  graph  with  n  nodes  associated 
to  discrete  random  variables  X  (a  variable  per  node), 
and  V  is  a  collection  of  parameters  d^k  =  p(Xi\pa]), 
with  dijk  =  1-  where  x £  is  a  value  or  state  of  Xi 
and  pa]  a  complete  instantiation  for  the  parents  PA ,  of 
Xi  in  G  (it  represents  a  set  of  states  for  PAf).  In  a 
BN  every  variable  is  conditionally  independent  of  its 
non-descendants  given  its  parents  (Markov  condition). 
Thus  the  joint  probability  distribution  is  obtained  by 
p(X Xn)  =  Yl^lX^PAi).  We  focus  on  param¬ 
eter  learning  in  a  BN  where  its  structure  is  known  in 
advance.  Given  a  data  set  D  where  each  element  is  a 
sample  of  the  BN  variables,  the  goal  of  parameter  learn¬ 
ing  is  to  find  the  most  probable  values  for  the  vector 
9,  which  can  be  quantified  by  the  log  likelihood  func¬ 
tion  Ld{9)  =  log(p(D|0)).  Assuming  that  samples  are 
drawn  independently  from  the  underlying  distribution, 
we  need  to  maximize  Le>{9)  =  log  JX^fc  ®ijk  '  where 
riijk  indicates  how  many  elements  of  D  contain  both  x f 
and  pa] .  Maximum  likelihood  estimation  has  its  opti¬ 
mum  at  eijk  =  ^-k. 

Another  usual  parameter  learning  technique  is  the 
Dirichlet  model,  where  one  starts  by  assuming  that  an 


expert  has  specified  a  prior  BN,  denoted  by  J\fp,  that 
conveys  her  prior  beliefs.  The  goal  is  to  learn  the  pa¬ 
rameters  of  multinomial  distributions  on  Oij  using  both 
Afp  and  data.  The  Dirichlet  distribution  is  a  natural  para¬ 
metric  model  for p(dij),  because  it  is  conjugate  with  the 
multinomial  distribution.  A  possible  parametrization  is 

p(%)  cx  rifeC/T”1  for  s  >  0  and  J2kTijk  =  !. 

where  the  hyper-parameter  s  controls  dispersion  and 
hyper-parameters  Tijk  control  location  [17].  The  pa¬ 
rameter  s  is  often  interpreted  as  the  size  of  a  database 
encoding  the  same  prior  beliefs  as  the  Dirichlet  distri¬ 
bution.  We  assume  that  Afp  is  associated  with  a  single 
positive  number  s  that  encodes  the  quality  of  the  prior 
BN,  and  that  parameters  of  J\fp  define  r  such  that  Tijk 
corresponds  to  9,jk  in  the  prior.  Then,  using  expectation 
as  estimator,  the  optimal  estimate  O^k  is  the  posterior 
expected  value:  dak  =  ST'iA+ni:>k  . 

Standard  estimation  methods  are  usually  enough 
when  there  are  enough  data.  However,  when  a  small 
amount  of  data  is  available,  they  may  produce  unre¬ 
liable  estimations.  A  way  to  improve  estimations  is 
through  the  use  of  constraints.  Let  9a  be  a  sequence  of 
parameters,  a  a  a  corresponding  sequence  of  constant 
numbers  and  a  also  a  constant.  A  linear  relationship 
constraint  is  defined  as 

^  ^  Otijk  *  Oijk  tT,  (1) 

Sijk  C:8a  ,  Otijk  Got  A 

that  is,  any  linear  constraint  over  parameters  can  be 
expressed  as  a  linear  relationship  constraint.  For  in¬ 
stance,  qualitative  influences  and  synergies  [18]  can  be 
expressed  by  linear  constraints.  Suppose  Xi,  X2l  X3 
are  random  variables  that  assume  values  in  {xj,xf}, 
with  x\  greater  than  xj,  such  that  X\  is  the  child  of  X2 
and  X3.  It  is  said  that  X2  has  a  positive  influence  on 
Xi  ifp{x\\xl,x\)  >  p(x\\x\,x\)1  andp{x\\xl,xl)  > 
p(x\ |x2,  x§),  that  is,  a  greater  value  of  X2  increases 
the  probability  of  a  greater  value  of  X\  (for  both  values 
of  A3).  A  positive  synergy  of  two  parents  in  a  com¬ 
mon  child  happens  when  the  parents  influence  the  child 
together,  for  example,  p(x\\x\,x\)  +  p(x\ |a;|,x§)  > 
p(x\\x\,  £3)  +  p(xf\x%,  x\).  This  means  that  a  greater 
value  of  the  child  is  more  likely  when  the  parents  have 
the  same  value.  In  fact  these  are  just  simple  (but  impor¬ 
tant)  examples  of  constraints  that  are  allowed.  Other  ex¬ 
amples  are  sum  of  parameters,  range,  relationship,  and 
ratio  constraints  [9],  weak  and  strong,  monotonic  and 
non-monotonic  influences  and  synergies  [11],  among 
many  others.  Our  assumption  about  constraints  is  even 
more  general:  they  must  define  a  (possibly  non-linear) 

1  We  use  a  probability  notation  for  ease  of  expose,  as  each  param- 
eter  Oijk  is  a  probability  value  of  the  network. 


convex  parameter  space,  that  is,  any  constraint  in  the 
form  h(9)  <  0,  where  h  is  convex,  is  allowed.  Most  of 
recent  literature  in  this  topic  can  be  expressed  by  con¬ 
vex  constraints  [9,  10,  11,  18],  Such  flexibility  allows 
for  a  better  description  of  the  knowledge,  as  we  have 
no  restriction  regarding  the  number  of  times  a  parame¬ 
ter  appear  in  constraints  or  whether  constraints  involve 
distinct  distributions  of  the  BN. 


3  Parameter  learning  using  convex  opti¬ 
mization 

If  we  only  have  constraints  on  9,  ML  can  be  solved 
by  convex  programming,  as  a  maximization  of  a  con¬ 
cave  log-likelihood  function  must  be  solved.  There  are 
optimization  algorithms  to  solve  convex  programming 
in  polynomial  time  [2,  4].  They  can  be  as  fast  as  lin¬ 
ear  programming  solvers.  Convex  programming  has 
the  attractive  property  that  any  local  optimum  is  also 
a  global  optimum  [4].  This  idea  of  constrained  ML  is 
rather  intuitive  in  its  interpretation  and  in  the  method  to 
solve  it.  However  ML  does  not  allow  us  to  define  prior 
assessments  (probabilities  cannot  be  directly  taken  as 
constraints  because  if  they  are  fixed  at  this  stage,  the 
empirical  data  cannot  change  them).  The  interpretation 
of  constraints  only  as  hard  constraints  on  estimations  is 
eventually  too  inflexible,  and  in  some  cases  it  is  more 
profitable  to  interpret  expert’s  belief  as  a  partial  assess¬ 
ment  of  a  prior  distribution. 

Besides  constraints  on  parameters  O^k,  we  allow 
constraints  on  hyper-parameters  Tijk  ■  The  idea  is  to 
work  with  both  constraints  on  9  and  constraints  on  r. 
So,  assume  that  an  expert  has  specified  two  sets  of  con¬ 
straints  denoted  by  Cp  and  C,  that  conveys  her  prior 
beliefs  and  some  knowledge  about  parameters,  respec¬ 
tively.  The  content  of  Cp  is  viewed  as  the  set  of  con¬ 
straints  on  the  hyper-parameters  Tijk  of  Dirichlet  dis¬ 
tributions  for  a  fixed  value  of  s,  while  C  is  the  set  of 
hard  constraints  on  estimations.  Constraints  on  Cp  are 
constraints  on  the  prior ,  not  in  the  probability  values 
themselves.  The  only  assumption  for  constraints  on  Cp 
is  that  they  must  be  convex  constraints  on  t  (this  is  same 
assumption  as  for  C).  If  the  expert  is  certain,  she  defines 
a  constraint  over  9.  Otherwise,  a  similar  constraint,  but 
now  over  r,  may  be  used.  In  summary,  constraints  of 
Cp  and  C  can  be  specified  similarly.  The  former  defines 
restrictions  on  parameters  r,  while  the  latter  defines  re¬ 
strictions  for  9.  This  formulation  is  based  on  the  Im¬ 
precise  Dirichlet  Model  [17],  which  has  received  great 
attention  recently  [5,  14], 

The  result  is  a  set  of  distributions  that  must  satisfy 


C(ff),  Cp(t)  and  equations 


STijk  ~h  Tlijk 

s  +  J2k  nijk 


(2) 


where  riijk  are  the  counts  from  the  data  set  (simplex 
constraints  V.y  =  1  and  X)*  Ajfc  =  1  are 

assumed  to  be  in  C(9)  and  Cp(t)).  Equation  (2)  is  the 
g/ue  between  C(9)  and  Cp(t).  This  formulation  has  sev¬ 
eral  attractive  features.  First,  it  deals  with  qualitative 
and  numerical  aspects  in  a  uniform  manner.  Second, 
it  uses  constraints  on  priors  and  on  estimations,  mak¬ 
ing  possible  to  the  expert  to  state  both  hard  and  soft 
constraints.  As  any  convex  constraint  is  allowed,  it  is 
possible  to  specify  precise  probability  measures  as  well 
as  vacuous  beliefs  (the  specification  of  a  single  prior  is 
also  possible  as  it  is  a  sub-case).  Third,  a  single  hyper¬ 
parameter  s  must  be  elicited  to  capture  the  quality  of 
the  prior.  Fourth,  computations  are  efficient  as  all  con¬ 
straints  are  convex.  Still,  all  these  constraints  define  a 
set  of  distributions.  Then  we  employ  the  maximum  en¬ 
tropy  principle,  locally  applied  to  each  conditional  dis¬ 
tribution,  to  select  one  distribution  from  this  set.  Distri¬ 
butions  of  maximum  entropy  are  conservative  and  tend 
to  agree  with  frequencies  [16].  So,  the  framework  can 
be  summarized  as  the  following  optimization  problem: 


max  E  9'rjk  log  Oijk  ,  (3) 

ijk 


subject  to  Equations  (2),  convex  constraints  Cp(r)  and 
convex  constraints  C(9).  This  formulation  can  be  poly- 
nomially  solved  by  convex  programming,  as  Equation 
(3)  is  the  maximization  of  a  concave  function  [8]  subject 
to  convex  constraints.  The  resulting  9  is  our  estimation. 

Finally,  we  point  out  that  Equations  (2)  tend  the  val¬ 
ues  of  Oak  to  be  close  to  ™'Jk —  (the  frequencies  of 

J  2^k  nijk 

9 ijk  in  the  data  set),  while  constraints  on  9 ijk  defined  by 
the  expert  might  (or  might  not)  impose  another  different 
value.  As  we  assume  that  hard  constraints  specified  by 
the  expert  are  correct,  a  penalty  optimization  variable  is 
introduced  in  Equation  (2)  to  guarantee  that  preference 
is  given  to  the  hard  constraints  defined  by  the  expert 
and  that  the  problem  will  not  be  infeasible  (as  long  as 
expert’s  constraints  are  not  already  infeasible,  in  which 
case  the  expert  should  update  her  beliefs). 


4  Experiments 


as  our  true  model  and  generate  samples  from  it.  Then 
Kullback-Leibler  (KL)  divergence  is  performed  to  mea¬ 
sure  the  difference  from  distributions  of  estimated  net¬ 
works  to  distributions  of  true  networks.  Random  lin¬ 
ear  (convex)  constraints  are  generated  from  1  to  5  pa¬ 
rameters  each.  The  constraints  are  created  over  the  true 
network  (so  they  are  certainly  correct)  in  number  equal 
to  the  number  of  conditional  distributions  in  the  cor¬ 
responding  network.  For  each  configuration,  we  work 
with  30  random  sets  of  data  and  constraints.  Averages 
of  KL  divergence  are  presented  in  Table  1.  Columns 
have  results  of  standard  ML,  Constrained  ML  and  Con¬ 
strained  Maximum  Entropy  (CME). 

Results  indicate  a  strong  decrease  in  the  diver¬ 
gence  when  working  with  constraints  (second  and  third 
columns  of  each  block).  Moreover,  benefits  are  more 
significant  with  larger  networks.  We  note  that  re¬ 
sults  with  constraints  using  only  10  samples  are  bet¬ 
ter  than  results  using  100  or  even  1000  samples  with¬ 
out  constraints,  which  indicate  a  relevant  decrease  in 
the  amount  of  data  that  would  be  necessary  for  achiev¬ 
ing  the  same  accuracy.  The  third  column  of  each  block 
shows  results  for  the  combination  of  constraints  on  pri¬ 
ors  and  estimations  using  the  maximum  entropy  idea. 
It  is  interesting  to  note  the  advantages  when  compared 
to  the  second  column,  because  CML  already  uses  con¬ 
straints  on  estimations. 

We  also  consider  the  problem  of  recognizing  facial 
action  units  from  real  image  data.  Based  on  the  Facial 
Action  Coding  System  [6],  facial  behaviors  can  be  de¬ 
composed  into  a  set  of  Action  Units  (AUs).  We  work 
with  a  BN  with  28  nodes  to  recognize  14  common  oc¬ 
curring  AUs  (there  are  a  hidden  and  a  measurement 
node  for  each  AU).  The  structure  of  the  BN  is  learned  as 
described  in  Tong  et  al.  [13].  We  define  42  simple  linear 
constraints,  mainly  describing  influences  among  AUs. 
The  8000  images  from  Cohn  and  Kanade’s  DFAT-504 
database  are  used.  Testing  is  performed  over  20%  of 
the  data  (not  chosen  for  training).  We  consider  training 
data  sets  with  100  and  1000  samples  (as  constraints  are 
more  relevant  when  insufficient  data  are  available),  cho¬ 
sen  randomly  from  the  training  database.  Results  are 
shown  in  Table  2.  CME  obtains  an  overall  recognition 
rate  (percentage  of  correctly  classified  cases)  of  93.1%, 
which  is  similar  to  current  state-of-the-art  results.  For 
instance,  Tong  et  at.  [13]  report  93.3%,  Bartlett  et  al. 
[3]  report  93.6%,  and  other  methods  have  results  with 
slight  variance  [12,  15]. 


We  perform  experiments  using  data  sets  with  10,  100 
and  1000  samples.  Three  well-known  networks  are 
evaluated:  Asia  (Lauritzen  and  Spiegelhalter),  Alarm 
(Beinlich  et  al.),  and  Insurance  network  (Binder  et  al.). 
For  each  network,  we  take  one  given  parametrization 


5  Conclusion 

This  paper  presents  a  framework  for  parameter  learn¬ 
ing  when  domain  knowledge  is  available  in  the  form 


Network 

Nodes 

Distr 

Dimension 

10  samples 

ML  CML  CME 

100  samples 

ML  CML  CME 

1000  samples 

ML  CML  CME 

Asia 

9 

21 

21 

1.22 

0.25 

0.27 

0.14 

0.11 

0.05 

0.03 

0.03 

Alarm 

37 

243 

509 

3.26 

0.61 

2.51 

1.19 

0.47 

1.12 

0.58 

0.26 

Insurance 

27 

411 

1008 

3.62 

1.56 

0.63 

2.24 

0.92 

0.44 

0.96 

0.40 

0.20 

Table  1.  Average  KL  divergence  using  random  samples  and  constraints. 


Rate 

100  samples 
ML  CME 

1000  samples 
ML  CME 

Positive 

Negative 

65.6%  75.7% 

95.8%  97.7% 

73.2%  83.9% 

95.9%  97.0% 

Table  2.  Positive  and  negative  rates  for  AU 
recognition. 


of  convex  constraints.  We  have  introduced  a  new  idea 
based  on  the  Imprecise  Dirichlet  Model  and  the  max¬ 
imum  entropy  criterion  that  is  able  to  deal  with  con¬ 
straints  on  priors  and  on  estimations.  The  framework 
is  fast  and  guarantees  to  find  the  global  optimum  solu¬ 
tion.  Through  experiments  with  well-known  networks, 
we  show  that  both  constraints  on  priors  and  estimations 
are  important  to  improve  parameter  learning  accuracy. 

The  main  contribution  of  this  work  is  to  allow  an 
expert  to  specify  her  knowledge  using  hard  constraints 
on  estimations  and  soft  constraints  on  priors,  with  no 
restrictions  on  the  format  of  constraints  besides  con¬ 
vexity.  As  far  as  we  know,  no  previous  methods  were 
able  to  handle  such  general  situation.  The  idea  can 
also  be  embedded  into  an  iterative  procedure  to  treat  in¬ 
complete  data,  similar  to  the  Expectation-Maximization 
(EM)  method.  This  discussion  is  left  for  the  future,  but 
we  anticipate  that  the  benefits  are  similar  to  those  of 
complete  data.  We  believe  that  the  application  of  these 
ideas  to  real  domains  is  promising  and  we  intend  to  pur¬ 
sue  that  in  a  future  work,  as  well  as  an  investigation 
about  the  effect  of  wrong  constraints. 
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Abstract 

This  paper  describes  a  new  algorithm  to 
solve  the  decision  making  problem  in  In¬ 
fluence  Diagrams  based  on  algorithms  for 
credal  networks.  Decision  nodes  are  asso¬ 
ciated  to  imprecise  probability  distributions 
and  a  reformulation  is  introduced  that  finds 
the  global  maximum  strategy  with  respect 
to  the  expected  utility.  We  work  with  Lim¬ 
ited  Memory  Influence  Diagrams,  which  gen¬ 
eralize  most  Influence  Diagram  proposals  and 
handle  simultaneous  decisions.  Besides  the 
global  optimum  method,  we  explore  an  any¬ 
time  approximate  solution  with  a  guaran¬ 
teed  maximum  error  and  show  that  imprecise 
probabilities  are  handled  in  a  straightforward 
way.  Complexity  issues  and  experiments 
with  random  diagrams  and  an  effects-based 
military  planning  problem  are  discussed. 

1  INTRODUCTION 

An  influence  diagram  is  a  graphical  model  for  deci¬ 
sion  making  under  uncertainty  [13].  It  is  composed 
by  a  directed  graph  where  utility  nodes  are  associated 
to  profits  and  costs  of  actions,  chance  nodes  represent 
uncertainties  and  dependencies  in  the  domain  and  de¬ 
cision  nodes  represent  actions  to  be  taken.  Given  an 
influence  diagram,  a  strategy  defines  which  decision  to 
take  at  each  node,  given  the  information  available  at 
that  moment.  Each  strategy  has  a  corresponding  ex¬ 
pected  utility.  One  of  the  most  important  problems  in 
influence  diagrams  is  strategy  selection ,  where  we  need 
to  find  the  strategy  with  maximum  expected  utility. 
A  simple  approach  is  to  evaluate  each  possible  strat¬ 
egy  and  compare  their  expected  utilities.  However,  the 
number  of  strategies  grows  exponentially  in  the  num¬ 
ber  of  decision  to  be  taken. 

In  this  paper,  we  propose  a  new  idea  to  find  the  best 
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strategy  based  on  a  reformulation  of  the  problem  as 
an  inference  in  a  credal  network  [4] .  We  show  through 
experiments  that  this  approach  can  handle  small  and 
medium  diagrams  exactly,  and  provides  an  anytime 
approximation  in  case  we  stop  the  process  early.  Our 
idea  works  with  a  very  general  class  of  influence  di¬ 
agrams,  named  Limited  Memory  Influence  Diagrams 
(LIMIDs)  [15].  Limited  Memory  means  that  the  as¬ 
sumption  of  no-forgetting  usually  employed  in  Influ¬ 
ence  Diagrams  (that  is,  values  of  observed  variables 
and  decisions  that  have  been  taken  are  remembered  at 
all  later  times)  is  relaxed.  This  class  of  diagrams  is 
interesting  because  most  other  influence  diagram  pro¬ 
posals  can  be  efficiently  converted  into  LIMIDs. 

To  solve  strategy  selection,  many  approaches  work  on 
special  cases  of  influence  diagrams,  exploiting  their 
characteristics  to  improve  performance.  In  many 
cases,  it  is  assumed  that  there  is  an  ordering  on  which 
the  decisions  are  to  be  taken  and  the  no-forgetting  rule, 
so  as  previous  decisions  are  assumed  to  be  known  in 
the  moment  of  the  current  decision  [14,  18,  19,  20,  21]. 
The  ordering  of  decision  nodes  is  exploited  to  eval¬ 
uate  the  optimal  strategy.  There  are  also  proposals 
in  the  class  of  simultaneous  influence  diagrams,  where 
decisions  are  assumed  to  have  no  antecedents.  This 
assumption  reduces  the  number  of  possible  strategies 
and  allows  for  factorization  ideas  [22] .  LIMIDs  do  not 
have  assumptions  about  no-forgetting  and  ordering  for 
decisions,  even  though  it  is  possible  to  convert  dia¬ 
grams  that  have  such  assumptions  into  LIMIDs. 

In  order  to  test  our  method,  we  generate  a  data  set 
of  random  influence  diagrams.  Empirical  results  indi¬ 
cate  that  the  accuracy  of  our  method  is  better  than 
other  approaches’.  We  also  apply  our  idea  to  solve 
an  Effects-based  operations  (EBO)  military  planning. 
The  EBO  approach  seeks  for  a  campaign  objective  by 
considering  direct,  indirect  and  cascading  effects  of 
military,  diplomatic,  psychological  and  economic  ac¬ 
tions  [6,  11].  We  use  an  influence  diagram  to  model  an 
EBO  hypothetical  problem. 


Section  2  introduces  our  notation  for  influence  dia¬ 
grams  and  the  problem  of  strategy  selection.  Section  3 
describes  the  framework  of  credal  networks  and  the  in¬ 
ference  problem  on  such  networks.  Section  4  presents 
how  we  solve  strategy  selection  through  a  reformula¬ 
tion  of  the  problem  as  an  inference  in  credal  networks. 
Section  5  presents  some  experiments,  including  the 
EBO  military  planning  problem,  and  finally  Section 
6  concludes  the  paper  and  indicates  future  work. 

2  INFLUENCE  DIAGRAMS 

A  Limited  Memory  Influence  Diagram  X  is  composed 
by  a  directed  acyclic  graph  (V,  E )  where  nodes  are 
partitioned  in  three  types:  chance,  decision  and  utility 
nodes.  Let  C,  V  and  U  be  the  set  of  chance,  decision 
and  utility  nodes,  respectively,  and  let  X  =  C  UP. 
Links  of  E  characterize  dependencies  among  nodes. 
Explicitly,  links  toward  a  chance  node  indicate  prob¬ 
abilistic  dependence  of  the  node  on  its  parents;  links 
toward  a  decision  node  indicate  which  information  is 
available  to  take  such  decision,  and  links  toward  utility 
nodes  represent  that  an  utility  for  those  parents  is  to 
be  considered  (utility  nodes  may  not  have  children). 
Associated  to  each  node,  there  are  some  parameters: 

1.  A  chance  node  has  an  associated  categorical  ran¬ 
dom  variable  C  with  finite  domain  f Iq  and  con¬ 
ditional  probability  distributions  p(C\nj(C)),  for 
each  configuration  Tij  (C)  of  its  parents  tt(C)  in 
the  graph,  j  is  used  to  indicate  a  configuration  of 
the  parents  of  C,  that  is,  7 Tj{C)  G  fUc),  where 
the  notation  fly'  =  Xy6y'  Uy,  for  any  V'  C  V. 

2.  A  decision  node  D  is  associated  to  a  finite  set  of 
mutually  exclusive  alternatives  Qd-  Parents  of  D 
describe  the  information  that  is  available  at  the 
moment  on  which  decision  D  has  to  be  taken. 

3.  An  utility  node  U  is  associated  to  a  rational  func¬ 
tion  fu  :  f U(m  — : >  Q.  The  value  corresponding  to 
a  parent  configuration  is  the  profit  (cost  is  viewed 
as  negative  profit)  of  such  parent  configuration. 
Utility  nodes  have  no  children. 

A  simple  example  is  depicted  in  Figure  1.  De¬ 
cision  nodes  are  represented  by  rectangles,  chance 
nodes  by  ellipses  and  utility  nodes  by  diamonds. 
do-ground-attack  has  an  associated  cost,  which  is  de¬ 
picted  by  the  corresponding  utility  node.  The  same  is 
modeled  for  bomb-bridge.  The  goal  is  to  achieve  ter¬ 
ritory-occupation,  which  also  has  an  utility  (the  profit 
of  the  goal),  ground-attack  and  bridge-condition  repre¬ 
sent  the  uncertain  outcomes  of  the  corresponding  ac¬ 
tions.  Note  that  there  is  no  known  ordering  on  which 


Figure  1:  Simple  Influence  Diagram  example. 

decisions  must  be  taken.  Although  decision  nodes  have 
no  parents  in  this  example,  there  is  no  such  restriction. 

A  policy  6d  for  the  decision  node  D  is  a  function 
Sd  ■  f Iduit(d)  — >  [0,1]  defined  for  each  alternative 
of  D  and  each  configuration  of  7 t{D)  such  that,  for 
each  TTj(D)  G  we  have  J2denD  TTj(D))  =  1. 

A  pure  policy  is  a  policy  such  that  its  image  is  inte¬ 
ger  ( 6d  ■  ^dutv(d)  {0,1}),  and  thus  specifies  with 
certainty  which  action  (alternative  of  D)  is  taken  for 
each  parent  configuration  (in  a  pure  policy,  only  one 
SD(d,TTj(D))  for  each  nj(D)  will  be  non-zero  as  they 
sum  1).  A  strategy  A  is  a  set  of  policies  {Sd  ■  D  G  V}, 
one  for  each  decision  node  of  the  diagram.  A  pure 
strategy  is  composed  only  by  pure  policies. 

The  expected  utility  EU{A)  of  a  strategy  A  is  evalu¬ 
ated  through  the  following  equation: 

\Y[p(XcWj(C))Y[SD{XD)^2fu{^j'{U))  J  , 
xGfi*  \  C  D  U  J 

(1) 

where  xc,  ^j{C),  Xd  and  nj>(U)  are  respectively  the 
projections  of  x  in  flc,  ^duk(D)  and  Uw(c/). 

This  equation  means  that,  given  a  strategy,  its  ex¬ 
pected  utility  is  the  sum  of  the  utility  values  weighted 
by  the  probability  of  each  diagram  configuration  (for 
all  configurations).  The  maximum  expected  utility  is 
obtained  over  all  possible  strategies: 

MEU  =  maxEU(A). 

The  problem  of  strategy  selection  is  to  obtain  the 
strategy  that  maximizes  its  expected  utility,  that  is, 
argmaxmaxA  EU{ A). 

3  CREDAL  NETWORKS 

We  need  some  concepts  of  credal  networks  before  pre¬ 
senting  the  reformulation  to  solve  strategy  selection. 
A  convex  set  of  probability  distributions  is  called  a 


credal  set  [4].  A  credal  set  for  X  is  denoted  by  K(X); 
we  assume  that  every  random  variable  is  categori¬ 
cal  and  that  every  credal  set  has  a  finite  number  of 
vertices.  Given  a  credal  set  K(X )  and  an  event  A, 
the  upper  and  lower  probability  of  A  are  respectively 
ma Xp(x)£K{x)  P(A)  and  min p(x)eK(x)P(A).  A  condi¬ 
tional  credal  set  is  a  set  of  conditional  distributions, 
obtained  by  applying  Bayes  rule  to  each  distribution 
in  a  credal  set  of  joint  distributions. 

A  (separately  specified)  credal  network  N  =  (G,  X,  K) 
is  composed  by  a  directed  acyclic  graph  G  =  (V,  E) 
where  each  node  of  V  is  associated  with  a  random 
variable  Xi  £  X  and  with  a  collection  of  conditional 
credal  sets  K(Xj\ir(Xi))  €  K,  where  n(Xi)  denotes 
the  parents  of  Xi  in  the  graph.  Note  that  we  have  a 
conditional  credal  set  related  to  Xi  for  each  configura¬ 
tion  7 Tj(Xi)  e  nw(Xi)-  A  root  node  is  associated  with 
a  single  marginal  credal  set.  We  take  that  in  a  credal 
network  every  random  variable  is  independent  of  its 
non-descendants  non-parents  given  its  parents;  this  is 
the  Markov  condition  on  the  network.  In  this  paper 
we  adopt  the  concept  of  strong  independence 1:  two 
random  variables  Xi  and  Xj  are  strongly  independent 
when  every  extreme  point  of  K(Xi,  Xj)  satisfies  stan¬ 
dard  stochastic  independence  of  Xi  and  Xj  (that  is, 
p( Xi\Xj)  =p(Xi)  and  p(Xj\Xj)  =  p(Xj))  [4],  Strong 
independence  is  the  most  commonly  adopted  concept 
of  independence  for  credal  sets,  probably  due  to  its 
connection  with  standard  stochastic  independence. 

Given  a  credal  network,  its  extension  is  any  joint  credal 
set  that  satisfies  all  constraints  encoded  in  the  net¬ 
work.  The  strong  extension  K,  of  a  credal  network  is 
the  largest  joint  credal  set  such  that  every  variable 
is  strongly  independent  of  its  non-descendants  non¬ 
parents  given  its  parents.  The  strong  extension  of  a 
credal  network  is  the  joint  credal  set  that  contains  ev¬ 
ery  possible  combination  of  vertices  for  all  credal  sets 
in  the  network  [5];  that  is,  each  vertex  of  a  strong  ex¬ 
tension  factorizes  as  follows: 

p(X1,...,Xn)  =  l[p(Xi\7r(Xi)).  (2) 

i 

Thus,  a  credal  network  can  be  viewed  as  a  represen¬ 
tation  for  a  set  of  Bayesian  networks  with  distinct  pa¬ 
rameters  but  sharing  the  same  graph. 

3.1  INFERENCE 

A  marginal  inference  in  a  credal  network  is  the  com¬ 
putation  of  upper  (or  lower)  probabilities  in  an  exten¬ 
sion  of  the  network.  If  Xq  is  a  query  variable,  then  a 
marginal  inference  is  the  computation  of  tight  bounds 

1  We  note  that  other  concepts  of  independence  are  found 
in  the  literature  [3,  10]. 


for  p{xq)  for  one  or  more  categories  xq  of  Xq.  For  in¬ 
ferences  in  strong  extensions,  it  is  known  that  distribu¬ 
tions  that  maximize  p(xq)  belong  to  the  set  of  vertices 
of  the  extension  [12].  So,  an  inference  can  be  produced 
by  combinatorial  optimization,  as  we  must  find  a  ver¬ 
tex  for  each  local  credal  set  K(Xi\n(Xi))  so  that  Ex¬ 
pression  (2)  leads  to  a  maximum  of  p{xq).  In  general, 
inference  offers  tremendous  computational  challenges, 
and  exact  inference  algorithms  based  on  enumeration 
of  all  potential  vertices  face  serious  difficulties  [4], 

A  different  way  to  solve  the  problem  is  to  recognize 
that  an  upper  (or  lower)  value  for  p{xq)  may  be  ob¬ 
tained  by  the  optimization  of  a  multilinear  polynomial 
over  probability  values,  subject  to  constraints.  This 
idea  is  discussed  in  the  literature  and  different  methods 
to  reformulate  the  inference  problem  were  proposed 
[7,  9].  Empirical  results  suggest  that  this  is  the  most 
effective  way  for  exact  inferences.  In  the  next  section, 
we  describe  an  idea  based  on  bilinear  programming 
[9]  to  perform  inferences  in  credal  networks  and  show 
how  it  can  be  employed  to  solve  the  strategy  selection 
problem  of  influence  diagrams. 


4  STRATEGY  SELECTION  AS  A 
CREDAL  NET  INFERENCE 


Suppose  we  want  to  find  the  strategy  Aopt  that  max¬ 
imizes  the  expected  utility  in  an  influence  diagram  X, 
that  is,  A opt  =  argmaxME/7.  Let  /  and  /  be  the 
minimum  and  maximum  utility  values  specified  in  the 
diagram  for  all  possible  utility  nodes  and  parent  con¬ 
figurations,  that  is, 

/  =  min  fu(nj{U)),  J  =  max  fu{-Kj{U)). 

U,irj(U)  U,TTj(U) 


We  create  an  identical  influence  diagram  X'  except  that 
the  utility  function  ff  (for  each  node  U)  is  defined  as 


vt t^u) 


/-/ 


The  denominator  is  positive  because  /  <  /  (if  /  = 
/,  then  the  influence  diagram  is  trivial  as  all  utility 
values  are  equal).  We  note  that  this  transformation  is 
similar  to  that  proposed  by  Cooper  [2].  It  is  not  hard 
to  see  that  argma xMEU  =  argmax  MEU’  (just  take 
the  terms  out  of  summations  in  Equation  (1)),  and 


max  EU’(A) 
A 


maxA  EU(A )  —  \U\f 

T1 


This  implies  that  strategy  selection  in  X  is  the  same  as 
strategy  selection  in  X'.  Now,  we  translate  the  selec¬ 
tion  problem  of  X'  to  a  credal  network  inference.  Sup¬ 
pose  we  define  a  credal  network  with  a  similar  graph 
as  X'  such  that: 


•  Chance  nodes  are  directly  translated  as  nodes  of 
the  credal  network  (parents  are  the  same  as  in  I'). 

•  Utility  nodes  are  translated  to  binary  random 

nodes.  Let  U  be  an  utility  node  with  function  fjj- 
In  the  credal  network,  U  becomes  a  binary  node 
(with  the  same  parents  as  before)  and  categories 
u  and  -iu  such  that:  p(u\Ttj(U ))  =  and 

p(->u\TTj(U))  =  1  -plulnjiU))  [2], 

•  Decision  nodes  are  translated  to  probabilistic 
nodes  with  imprecise  distributions  such  that  poli¬ 
cies  become  probability  distributions  (in  fact,  ac¬ 
cording  to  our  definition  of  policy,  they  are  al¬ 
ready  greater  than  zero  and  sum  1).  Thus, 
p{d\'Kj{D))  =  6D(d,TTj(D))  for  all  d  and  irj(D). 
Note  that  p(D\nj(D)),  for  each  TTj(D),  is  a  dis¬ 
tribution  with  unknown  probability  values  (this 
interpretation  of  decision  nodes  as  imprecise  prob¬ 
ability  nodes  is  discussed  by  Antonucci  and  Zaf- 
falon,  see  e.g.  [1]). 

Using  this  credal  network  formulation,  the  expected 
utility  of  a  strategy  A  can  be  written  as 

EU’{ A)  =  Y  \]lP^(x\nj(x))^2p(uWj'{U))  \  , 
xefi*  \  x  u  ) 

where  x,  n j(X)  and  7 Tj>(U)  are  projections  of  x  into 
the  corresponding  domains,  X  ranges  on  all  nodes  cor¬ 
responding  to  chance  and  decision  nodes  of  the  influ¬ 
ence  diagram,  and  pa  represents  the  distribution  in¬ 
duced  by  the  strategy  A,  that  is,  when  the  strategy  is 
chosen,  pa  is  a  known  probability  distribution. 

With  some  simple  manipulations,  we  have: 

EU’( A)  =  Y  (pa(x)^p(w|7t^(U))  J  , 

xeQx  V  U  ) 

EU’(A)  =  Y  (5Z^ul7rJ'(c/))PA(x)  j  , 

xGfix  V  U  ) 

EU\ A)  =  Y  X!  PaO,x)  =  ^pa(w), 

U  xGfix  U 

and  then 

MEU’ =  max)  PA(w)=max>  p(u), 

A  ^  w  pel c 

u  u 

where  p  £  JC  means  that  we  select  a  distribution  p  in 
the  extension  of  the  credal  network.  In  fact  the  only 
places  p  may  vary  are  related  to  the  imprecise  proba¬ 
bilities  of  the  former  decision  nodes.  When  we  select 
p ,  we  get  a  precise  distribution  that  has  a  correspond¬ 
ing  strategy  A.  So,  we  have  a  credal  network  and 
need  to  find  a  distribution  p  that  maximizes  the  sum 
of  marginal  probabilities  of  the  U  nodes. 


4.1  INFERENCE  AS  AN  OPTIMIZATION 
PROBLEM 

The  sum  of  marginal  inferences  in  the  credal  network 
can  be  formulated  as  a  multilinear  programming  prob¬ 
lem.  The  goal  is  to  maximize  the  expression 

^p(M)  =  ^  Y  [p{u\'Kj>{U))\\_p(x\-Kj(X))  J  , 

U  U  xGQx  \  X  J 

(3) 

where  x,  TTj>(U)  and  TTj(X)  are  the  projections  of  x  in 
the  corresponding  domains,  and  where  some  distribu¬ 
tions  p(X\nj(X))  are  precisely  known  and  others  are 
imprecise.  In  this  formulation  we  must  deal  with  a 
large  number  of  multilinear  terms.  To  avoid  them,  we 
briefly  describe  the  bilinear  transformation  procedure 
proposed  by  de  Campos  and  Cozman  [9]  to  replace 
the  large  Expression  (3)  by  simple  bilinear  expressions. 
We  refer  to  [9]  for  additional  details. 

The  idea  is  based  on  a  precedence  ordering  of  the  net¬ 
work  variables,  which  is  an  ordering  where  all  ances¬ 
tors  of  a  given  variable  in  the  network’s  graph  appear 
before  it  in  the  ordering.  The  bilinear  transformation 
algorithm  processes  the  network  variables  top-down: 
at  each  step  some  constraints  are  generated  that  de¬ 
fine  the  relationship  between  the  query  and  the  cur¬ 
rent  variable  being  processed.  A  variable  may  be  pro¬ 
cessed  only  if  all  its  ancestors  have  already  been  pro¬ 
cessed.  The  active  nodes  at  each  step  form  a  path- 
decomposition  of  the  network’s  graph. 

To  better  explain  the  method,  we  take  the  exam¬ 
ple  of  Figure  1.  For  simplicity,  assume  that  vari¬ 
ables  are  binary2  (with  categories  b  and  — >6)  re¬ 
named  as  follows:  do-ground-attack  is  D\ ,  bomb-bridge 
is  £>2,  co st-of -attack  is  U\,  cost-of -bombing  is  U2, 
ground-attack  is  C±,  bridge-condition  is  C2,  terri¬ 
tory-occupation  is  C3,  and  finally  pro fit-of -goal  is  U3. 

After  the  translation  of  the  utility  functions  into  prob¬ 
ability  distributions  and  the  replacement  of  decision 
nodes  by  nodes  with  imprecise  probabilities  (as  previ¬ 
ously  described),  we  have  a  credal  network  and  need  to 
maximize  the  sum  of  the  marginal  probabilities  of  the 
U  nodes.  In  fact  this  is  an  extension  of  the  standard 
query  in  a  credal  network,  because  we  have  a  summa¬ 
tion  instead  of  a  single  probability  to  maximize.  So 
the  objective  function  is  max  p{u\)  +  p(u2)  +  p(u3) 
(there  are  three  utility  nodes  in  the  example)  sub¬ 
ject  to  constraints  that  define  each  marginal  proba¬ 
bility  p(u\),  p{u2)  and  p{u3).  To  create  these  con¬ 
straints,  we  run  a  symbolic  inference  based  on  the 
precedence  ordering  for  each  of  the  marginal  proba¬ 
bilities.  The  constraints  for  p(u\)  and  p{‘U2)  are  very 

2The  method  works  on  non-binary  variables  as  well. 
The  assumption  is  made  here  for  ease  of  expose. 


simple:  p(u±)  =  pfu\\di)p(di)  +  p(u±\-<di)p(-<di)  and 
p(u2)  =  p(u2\d2)p{d,2)+p(u2\^d2)p(-'d2),  because  they 
only  depend  on  one  other  variable.  Note  that  p(d\), 
p(-<di),  p{d2),  and  p(^d2)  that  appear  in  these  con¬ 
straints  are  unknown  and  thus  become  optimization 
variables  in  the  bilinear  problem. 

To  write  the  constraints  for  p(u3),  we  need  to  choose 
a  precedence  ordering.  We  will  use  the  ordering 
D‘2 ,  C2 .  £>i  ,  C\ ,  63 ,  U 3  (variables  U\  and  U2  do  not  ap¬ 
pear  in  the  order  as  they  are  not  relevant  to  evaluate 
the  marginal  £>(113)).  Hence,  the  first  variable  to  be 
processed  is  D2.  We  write  a  constraint  that  relates 
the  query  it  3  and  probabilities  p(D2)  (which  are  de¬ 
fined  in  the  network  specification): 

p(u 3)=  P{d) -p{u3\d). 

d€{d2,->d2} 

D2  now  appears  in  the  conditional  part  of  p(u3\d), 
which  may  be  viewed  as  an  artificial  term  in  the  opti¬ 
mization,  as  it  does  not  appear  in  the  network.  Be¬ 
cause  of  that,  we  must  create  constraints  to  define 
p(u3\d)  in  terms  of  network  parameters  (for  all  cat¬ 
egories  d  G  D2).  According  to  our  chosen  ordering, 
the  current  variable  to  be  processed  is  C2 ■  Thus, 

p(u3\d2)  =  ^  p(c\d2)  ■  p(u3\c), 

CS{C2,^C2} 

p(u3\^d2)  =  ^2  P(chd2)  ■  p{u3\c). 

ce{c2,^c2} 

Note  that  p{u3\c)  =  p(u3|c,  d)  (for  any  d),  so  we  use 
the  simpler.  At  this  stage,  our  query  is  conditioned  on 
C2-  Following  the  same  idea,  we  process  D\,  obtaining 

P(u3\c2)  =  ^  p{d)  ■  p{u3\c2ld), 

d£{di,->di} 

p{u3\^c2)  =  p(d)  -p{u3\^c2,d). 

de{di,-'di} 

Now  the  current  variable  to  be  treated  is  C 1,  and  our 
query  is  conditioned  on  C2,Di,  that  is,  we  must  de¬ 
fine  how  to  evaluate p(u3 1 C2,  D\)  for  all  configurations. 
Thus,  for  all  c  G  {02,  ^c2}  and  d  €  {d±,  ^di}: 

p{u3\c,d)=  ^2  P(c'\c,  d)  ■  p(u3\c,  c'). 

c'G{ci,-ici} 

At  this  moment,  113  is  conditioned  on  Ci,C2  in  the 
artificial  term  p(tt3|c,  c')  (D 1  is  not  present  in  the  ar¬ 
tificial  term  as  C\,C2  separate  U3  from  D\).  Now  we 
process  C3 :  for  all  c'  G  {ci,  _|Ci}  and  c  G  {02,  -1 c2 } 

p(u3|c,c')=  ^2  P(c"lc>c') -P(u3\c")- 

c"e{c3,^c3} 


Note  that,  as  p('«3|c")  is  specified  in  the  network,  we 
can  stop.  All  artificial  terms  are  related  (through  con¬ 
straints)  to  parameters  of  the  network.  Besides  all 
these  constraints,  we  also  include  simplex  constraints 
to  ensure  that  probabilities  sum  1. 

Hence,  we  have  a  collection  of  linear  and  bilinear  con¬ 
straints  on  which  non-linear  programming  can  be  em¬ 
ployed  [7].  It  is  also  possible  to  use  linear  integer  pro¬ 
gramming  [9].  The  steps  to  achieve  a  linear  integer 
programming  formulation  are  simple,  because  the  only 
non-linear  terms  of  the  problem  have  the  format  b  ■  t , 
where  b  G  (0, 1}  and  t  G  [0, 1].  b  is  an  unknown  proba¬ 
bility  value  of  the  credal  network  (which  is  zero  or  one 
because  the  solution  we  look  for  lies  on  extreme  points 
of  credal  sets  [12])  and  t  is  a  constant  or  an  artificial 
term  created  in  the  procedure  just  described.  To  lin¬ 
earize  the  problem,  b  ■  t  is  replaced  by  an  additional 
artificial  optimization  variable  y  and  the  following  con¬ 
straints  are  inserted:  0  <  y  <  b  and  t  —  1  +  b  <  y  <  t. 
After  replacing  all  non-linear  terms  using  this  idea,  the 
problem  becomes  a  linear  integer  programming  prob¬ 
lem,  where  a  solution  is  also  a  solution  for  the  strategy 
selection  in  the  initial  influence  diagram. 

We  emphasize  that,  as  we  are  translating  the  strat¬ 
egy  selection  problem  into  a  credal  network  inference, 
it  is  straightforward  to  use  imprecise  probabilities  in 
the  chance  nodes  of  the  influence  diagram.  Intervals 
or  sets  of  probabilities  may  be  used.  The  translation 
works  in  the  same  way,  but  the  generated  problem  will 
have  more  imprecise  probabilities  to  optimize. 

The  following  theorem  shows  that,  when  reformulat¬ 
ing  the  strategy  selection  problem  as  a  modified  credal 
network  inference,  we  are  not  making  use  of  “more  ef¬ 
fort”  than  necessary,  that  is,  strategy  selection  has  the 
same  complexity  as  inference  in  credal  networks. 

Theorem  1  Let  X  be  a  LIMID  and  k  a  rational.  De¬ 
ciding  whether  there  is  a  strategy  A  such  that  MEU 
is  greater  than  k  is  NP-  Complete  when  X  has  bounded 
induced  width,  and  NPpp -Complete  in  general. 

Proof  sketch :  Pertinence  for  the  bounded  induced 
width  case  is  achieved  because  (given  a  strategy)  we 
can  compute  MEU  and  verify  if  it  is  greater  than  k 
in  polynomial  time  (using  the  reformulation  and  the 
sum  of  marginal  queries,  each  marginal  query  takes 
polynomial  time  in  a  bounded  induced  width  Bayesian 
network);  in  the  general  case,  we  can  perform  this  ver¬ 
ification  using  a  PP  oracle.  Hardness  for  the  bounded 
induced  width  case  is  obtained  with  the  same  reduc- 

3The  maximum  clique  and  the  maximum  degree  in  the 
moral  graph  are  bounded  by  a  logarithmic  function  in  the 
size  of  the  input  needed  to  specify  the  problem,  which  for 
instance  includes  polytrees. 


tion  as  in  [8]  from  the  MAXSAT  problem  (replacing 
the  credal  nodes  with  decision  nodes  and  introducing 
a  single  utility  node).  In  the  general  case,  the  same  re¬ 
duction  as  in  [17]  from  E-MAJSAT  can  be  used  (MAP 
nodes  are  replaced  by  decision  nodes).  □ 

5  EXPERIMENTS 

We  conduct  two  experiments  with  the  procedure. 
First,  we  use  random  generated  influence  diagrams 
to  compare  the  solutions  obtained  by  our  procedure 
(which  we  call  CR  for  credal  reformulation)  against  the 
Single  Policy  Updating  (SPU)  of  Lauritzen  and  Nils¬ 
son  [15].  Later  we  work  with  a  practical  EBO  military 
planning  problem  and  compare  the  method  against  the 
factorization  of  Zhang  and  Ji  [22]  .4 

Concerning  random  influence  diagrams,  we  have  gen¬ 
erated  a  data  set  based  on  the  total  number  of  nodes 
and  the  number  of  decision  nodes.  The  configurations 
chosen  are  presented  in  the  first  two  columns  of  Table 
1.  We  have  from  10  to  120  nodes,  where  3  to  35  are 
decision  nodes.  The  number  of  utility  nodes  is  cho¬ 
sen  equal  to  the  number  of  decision  nodes.  Each  line 
in  Table  1  contains  the  average  result  for  30  random 
generated  diagrams  within  that  configuration.  The 
third  column  of  the  table  shows  the  approximate  aver¬ 
age  number  of  distinct  strategies  in  the  diagrams  that 
would  need  to  be  evaluated  by  a  brute  force  method. 

The  three  columns  of  the  CR  method  show  the  time 
spent  to  solve  the  problem,  the  number  of  nodes  evalu¬ 
ated  in  the  branch-and-bound  tree  of  the  optimization 
procedure  (which  is  significantly  smaller  than  the  total 
number  of  strategies  in  brute  force)  and  the  maximum 
error  of  the  solution  (all  numbers  are  averages).  Af¬ 
ter  the  reformulation,  the  CPLEX  solver  [16]  is  used, 
which  includes  a  heuristic  search  before  starting  the 
branch-and-bound  procedure.  The  evaluations  of  this 
heuristic  search  are  not  counted  in  the  fifth  column  of 
Table  1.  Note  that  the  first  five  rows  are  separated 
from  the  last  three  because  they  strongly  differ  on  the 
size  of  the  search  space  (exact  solutions  were  found 
only  for  the  former).  The  maximum  error  of  each  so¬ 
lution  is  obtained  straightforward  from  the  relaxation 
of  the  linear  integer  problem.  The  last  two  columns 
of  Table  1  show  the  time  and  maximum  error  of  the 
SPU  approximate  procedure.  Although  very  fast,  the 
SPU  procedure  has  worse  accuracy  than  the  “approxi¬ 
mate”  CR  (solution  was  approximate  in  last  three  rows 
because  we  have  imposed  a  time-limit  of  ten  minutes 
for  each  run).  Furthermore,  SPU  does  not  provide  an 
upper  bound  for  the  best  possible  expected  utility,  as 
obtained  by  CR.  Still,  a  possible  improvement  is  to  use 

4The  factorization  idea  only  works  on  simultaneous  in¬ 
fluence  diagrams,  so  it  was  not  used  in  the  other  test  cases. 


SPU  to  provide  an  initial  guess  to  the  optimization. 

5.1  EBO  MILITARY  PLANNING 

In  this  section  we  describe  the  performance  of  our 
method  in  an  hypothetical  Effects-based  Operations 
planning  problem  [11].  An  influence  diagram  similar 
to  the  model  described  by  Zhang  and  Ji  [22]  is 
employed.  Its  graph  is  shown  in  Figure  2.  The  goal  is 
to  win  a  war,  which  is  represented  by  the  Hypothesis 
node  (on  top  of  Figure  2).  Just  below  there  are  the 
subgoals  Airsuperiority ,  Territory-occupation ,  and 
Commandersurrender,  which  are  directly  related 
to  the  main  goal.  There  are  eleven  decision  nodes 
(represented  by  rectangles):  destroy-C2  (C2  stands 
for  Command  and  Control),  destroy-Radars,  de¬ 
stroy -Communications,  launch-airstrike,  destroy-RD, 
destroystorage,  destroy-assembly,  launch-ground- 
attack,  launch-broadcasting ,  capture-bodyguard, 
us  especial-force.  Just  above  decision  nodes,  we  have 
chance  nodes  representing  the  outcomes  of  performing 
such  actions  (they  indicate  the  workability  of  such 
systems),  and  below  we  have  utility  nodes  (diamond¬ 
shaped  nodes)  describing  the  cost  of  each  action. 
Furthermore,  we  have  six  chance  nodes  (in  the  center 
of  the  figure)  indicating  general  workability  of  IADS 
(Integrated  Air  Defense  System),  Air-force,  Artillery, 
Ground-force,  Morale  and  Commander-in-custody 
with  respect  to  enemy  forces.  The  overall  profit  of 
winning  is  given  by  the  node  Uh ,  child  of  Hypothesis. 

As  this  is  an  hypothetical  example,  we  define  utility 
functions  and  probability  distributions  as  follows: 

•  Probability  of  Hypothesis  is  one  given  that  all 
subgoals  are  achieved.  If  one  of  subgoals  is  not 
achieved,  then  the  probability  of  Hypothesis  is 
60%;  if  two  of  them  are  not  achieved,  then  the 
probability  of  success  is  30%;  if  none  of  subgoals 
is  achieved,  then  we  certainly  fail  in  the  campaign. 

•  For  the  subgoals  Airsuperiority,  Terri¬ 
tory-occupation,  and  Commandersurrender, 
we  define  that  the  subgoal  is  accomplished 
with  probability  one  when  both  children  were 
achieved,  50%  when  only  one  child  is  achieved, 
and  zero  when  none  is  achieved. 

•  For  the  probabilities  of  IADS,  Air-force,  Ar¬ 
tillery,  Ground-force,  Morale  and  Comman- 
der-in-custody,  we  define  a  decrease  of  50%  for 
each  unaccomplished  child  (with  a  minimum  of 
zero,  of  course).  Any  node  has  probability  zero  if 
two  or  more  of  its  children  are  not  achieved. 

•  The  outcomes  of  actions  (chance  nodes  above  de¬ 
cision  nodes)  have  90%  of  success.  For  exam- 


Nodes 

Approx.#  of 

CR 

SPU 

Total 

Decision 

Strategies 

Time  (sec) 

Evals  (B&B) 

Max.Error(%) 

Time  (sec) 

Max.Error(%) 

10 

3 

2iY 

0.66 

5 

0.000 

0.10 

0.740 

20 

6 

2  34 

1.73 

125 

0.000 

0.39 

2.788 

50 

10 

251 

30.42 

4048 

0.000 

1.62 

2.837 

60 

15 

252 

29.77 

2937 

0.000 

2.99 

1.964 

70 

20 

254 

125.06 

7132 

0.000 

5.52 

3.448 

120 

25 

21U2 

254.80 

15626 

0.544 

11.58 

2.193 

120 

30 

2116 

403.13 

5617 

4.639 

13.79 

7.281 

120 

35 

2120 

578.99 

9307 

5.983 

16.87 

11.584 

Table  1:  Average  results  on  30  random  influence  diagrams  of  different  sizes  for  the  CR  and  SPU  methods. 


pie,  destroy-Radars  will  have  EW/GCLradars  de¬ 
stroyed  with  90%  of  odds  (EW / GCI  means  Early 
Warning / Ground  Control  Interception). 

•  The  reward  of  achieving  the  main  goal  is  1000, 
while  not  achieving  it  costs  500. 

•  Costs  of  actions  are  as  follows:  ground-attack  is 
150,  us  especial-force  is  100,  capture-bodyguard  is 
80,  airstrike  is  50,  and  other  actions  cost  20  each. 

For  this  problem,  the  best  strategy  found  by  SPU 
has  expected  utility  of  —55.2825,  and  suggests  to 
take  all  action  except  destroy-RD ,  destroystorage,  de- 
stroy -assembly  and  launch-ground-attack.  The  global 
optimum  strategy  is  found  in  less  than  5  seconds  with 
our  method  and  has  expected  utility  equal  to  156.4051 
(all  actions  are  taken).  This  is  much  faster  than  the 
solution  reported  by  [22]  (around  45  seconds). 

6  CONCLUSION 

We  discuss  in  this  paper  a  new  idea  for  strategy  selec¬ 
tion  in  Influence  Diagrams.  We  work  with  the  Limited 
Memory  Influence  Diagram,  as  it  generalizes  many  of 
the  influence  diagram  proposals.  The  main  contribu¬ 
tion  is  the  reformulation  of  the  problem  as  a  credal 
network  inference,  which  makes  possible  to  find  the 
global  maximum  strategy  for  small-  and  medium-sized 
influence  diagrams.  Experiments  indicate  that  many 
instances  can  be  treated  exactly.  As  far  as  we  know, 
no  deep  investigation  of  exact  procedures  for  this  class 
of  diagrams  has  been  conducted. 

Because  of  the  characteristics  of  our  procedure,  an 
anytime  approximate  solution  with  a  maximum  guar¬ 
anteed  error  is  available  during  computations.  It  is 
clear  that  large  diagrams  must  be  treated  approxi¬ 
mately.  Nevertheless,  in  the  conducted  experiments, 
our  method  produced  results  that  surpass  existing  al¬ 
gorithms.  Although  spending  more  time,  many  sit¬ 
uations  require  a  solution  to  be  as  good  as  possible, 


while  time  is  a  secondary  issue.  The  ability  of  our  ap¬ 
proach  to  provide  an  upper  bound  for  the  result  is  also 
valuable,  which  is  not  available  with  the  SPU  method. 

We  also  discuss  the  theoretical  complexity  of  the  prob¬ 
lem,  which  is  derived  from  the  known  properties  of 
MAP  problems  in  Bayesian  networks  and  belief  up¬ 
dating  inferences  in  credal  networks.  The  complex¬ 
ity  results  show  that  the  proposed  idea  is  not  making 
use  of  a  harder  problem  to  solve  a  simpler  one,  as 
the  complexity  of  strategy  selection  is  the  same  as  the 
complexity  of  inferences  in  credal  networks. 

Because  strategy  selection  in  influence  diagrams  and 
inferences  in  credal  networks  are  related,  improve¬ 
ments  on  algorithms  of  credal  networks  can  be  directly 
applied  to  influence  diagram  problems.  The  applica¬ 
tion  of  other  approximate  techniques  based  on  credal 
networks  seems  a  natural  path  for  investigation.  We 
also  intend  to  explore  other  optimization  criteria  for 
influence  diagrams  with  imprecise  probabilities,  be¬ 
sides  expected  utility.  Proposals  in  the  theory  of  im¬ 
precise  probabilities  might  be  applied  to  this  setting. 
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Abstract 

This  paper  addresses  exact  learning  of 
Bayesian  network  structure  from  data  and 
expert’s  knowledge  based  on  score  functions 
that  are  decomposable.  First,  it  describes 
useful  properties  that  strongly  reduce  the 
time  and  memory  costs  of  many  known  meth¬ 
ods  such  as  hill-climbing,  dynamic  program¬ 
ming  and  sampling  variable  orderings.  Sec¬ 
ondly,  a  branch  and  bound  algorithm  is  pre¬ 
sented  that  integrates  parameter  and  struc¬ 
tural  constraints  with  data  in  a  way  to  guar¬ 
antee  global  optimality  with  respect  to  the 
score  function.  It  is  an  any-time  procedure 
because,  if  stopped,  it  provides  the  best  cur¬ 
rent  solution  and  an  estimation  about  how 
far  it  is  from  the  global  solution.  We  show 
empirically  the  advantages  of  the  properties 
and  the  constraints,  and  the  applicability  of 
the  algorithm  to  large  data  sets  (up  to  one 
hundred  variables)  that  cannot  be  handled 
by  other  current  methods  (limited  to  around 
30  variables). 

1.  Introduction 

A  Bayesian  network  (BN)  is  a  probabilistic  graphical 
model  that  relies  on  a  structured  dependency  among 
random  variables  to  represent  a  joint  probability  dis¬ 
tribution  in  a  compact  and  efficient  manner.  It  is 
composed  by  a  directed  acyclic  graph  (DAG)  where 
nodes  are  associated  to  random  variables  and  condi¬ 
tional  probability  distributions  are  defined  for  vari¬ 
ables  given  their  parents  in  the  graph.  Learning  the 
graph  (or  structure)  of  a  BN  from  data  is  one  of  the 
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most  challenging  problems  in  such  models.  Best  exact 
known  methods  take  exponential  time  on  the  num¬ 
ber  of  variables  and  are  applicable  to  small  settings 
(around  30  variables).  Approximate  procedures  can 
handle  larger  networks,  but  usually  they  get  stuck  in 
local  maxima.  Nevertheless,  the  quality  of  the  struc¬ 
ture  plays  a  crucial  role  in  the  accuracy  of  the  model. 
If  the  dependency  among  variables  is  not  properly 
learned,  the  estimated  distribution  may  be  far  from 
the  correct  one.  In  general  terms,  the  problem  is  to 
find  the  best  structure  (DAG)  according  to  some  score 
function  that  depends  on  the  data  (Heckerman  et  al., 
1995).  There  are  other  approaches  to  learn  a  struc¬ 
ture  that  are  not  based  on  scoring  (for  example  taking 
some  statistical  similarity  among  variables),  but  we 
do  not  discuss  them  in  this  paper.  The  research  on 
this  topic  is  active,  e.g.  (Chickering,  2002;  Teyssier  & 
Roller,  2005;  Tsamardinos  et  al.,  2006).  Best  exact 
ideas  (where  it  is  guaranteed  to  find  the  global  best 
scoring  structure)  are  based  on  dynamic  programming 
(Koivisto  et  al.,  2004;  Singh  &  Moore,  2005;  Koivisto, 
2006;  Silander  &  Myllymaki,  2006),  and  they  spend 
time  and  memory  proportional  to  n  •  2",  where  n  is 
the  number  of  variables.  Such  complexity  forbids  the 
use  of  those  methods  to  a  couple  of  tens  of  variables, 
mostly  because  of  memory  consumption. 

In  the  first  part  of  this  paper,  we  present  some  proper¬ 
ties  of  the  problem  that  bring  a  considerable  improve¬ 
ment  on  many  known  methods.  We  perform  the  anal¬ 
ysis  over  some  well  known  criteria:  Akaike  Informa¬ 
tion  Criterion  (AIC),  and  the  Minimum  Description 
Length  (MDL),  which  is  equivalent  to  the  Bayesian  In¬ 
formation  Criterion  (BIC).  However,  results  extrapo¬ 
late  to  the  Bayesian  Dirichlet  (BD)  scoring  (Cooper  & 
Herskovits,  1992)  and  some  derivations  under  a  few  as¬ 
sumptions.  We  show  that  the  search  space  of  possible 
structures  can  be  reduced  drastically  without  losing 
the  global  optimality  guarantee  and  that  the  memory 
requirements  are  very  small  in  many  practical  cases 
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(we  show  empirically  that  only  a  few  thousand  scores 
are  stored  for  a  problem  with  50  variables  and  one 
thousand  instances). 

As  data  sets  with  many  variables  cannot  be  efficiently 
handled  (unless  P=NP,  as  the  problem  is  known  to  be 
NP-hard  (Chickering  et  al.,  2003)),  a  desired  property 
of  a  method  is  to  produce  an  any-time  solution,  that 
is,  the  procedure,  if  stopped  at  any  moment,  provides 
an  approximate  solution,  while  if  run  until  it  finishes,  a 
global  optimum  solution  is  found.  However,  the  most 
efficient  exact  methods  are  not  any-time.  We  propose 
a  new  any-time  exact  algorithm  using  a  branch-and- 
bound  (B&B)  approach  with  caches.  Scores  are  com¬ 
puted  during  the  initialization  and  a  poll  is  built.  Then 
we  perform  the  search  over  the  possible  graphs  iter¬ 
ating  over  arcs.  Although  iterating  over  orderings  is 
probably  faster,  iterating  over  arcs  allows  us  to  work 
with  constraints  in  a  straightforward  way.  Because  of 
the  B&B  properties,  the  algorithm  can  be  stopped  at 
any-time  with  a  best  current  solution  found  so  far  and 
an  upper  bound  to  the  global  optimum,  which  gives  a 
kind  of  certificate  to  the  answer  and  allows  the  user 
to  stop  the  computation  when  she  believes  that  the 
current  solution  is  good  enough.  (Suzuki,  1996)  has 
proposed  a  B&B  method,  but  it  is  not  a  global  exact 
algorithm,  instead  the  search  is  conducted  after  a  node 
ordering  is  fixed.  Our  method  does  not  rely  on  a  pre¬ 
defined  ordering  and  finds  a  global  optimum  structure 
considering  all  possible  orderings. 

2.  Bayesian  networks 

A  BN  represents  a  single  joint  probability  density  over 
a  collection  of  random  variables.  It  can  be  defined 
as  a  triple  (Q,X,V),  where  Q  =  ( Vg,Eg )  is  a  DAG 
with  Vg  a  collection  of  n  nodes  associated  to  random 
variables  X  (a  node  per  variable),  and  Eg  a  collec¬ 
tion  of  arcs;  V  is  a  collection  of  conditional  proba¬ 
bility  densities  p{Xi\PAf)  where  PAi  denotes  the  par¬ 
ents  of  Xi  in  the  graph  (PAi  may  be  empty),  respect¬ 
ing  the  relations  of  Eg.  We  assume  throughout  that 
variables  are  categorical.  In  a  BN  every  variable  is 
conditionally  independent  of  its  non-descendants  given 
its  parents  (Markov  condition).  This  structure  in¬ 
duces  a  joint  probability  distribution  by  the  expression 
p(X±, ...,  Xn)  =  n,  p(Xi\ PAi).  Before  proceeding,  we 
define  some  notations.  Let  r,  >  2  be  the  number  of 
discrete  categories  of  X, ,  qt  the  number  of  elements 
in  flpAi  (the  number  of  configurations  of  the  parent 
set,  that  is,  qi  =  II x  ePA,  r* )  and  0  be  the  entire 
vector  of  parameters  such  as  Qijk  =  p(x1l\pa>i),  where 
i  G  {1, . . . ,  n},  j  G  {1,...,%},  k  G  {1,  Ti}  (hence 
x'l  G  flxi  and  pa {  G  DpaJ. 


Given  a  complete  data  set  D  =  {D1, . . . ,  DN}  of  with 
N  instances,  with  Dt  =  {x{\, . . . ,  xkrft}  a  instance  of 
all  variables,  the  goal  of  structure  learning  is  to  find 
a  Q  that  maximizes  a  score  function  such  as  MDL  or 
AIC. 

ma xsd(G)  =  max(Lp>(0)  —  t  ■  W ), 

Q  0 

where  9  represents  all  parameters  of  the  model  (and 
thus  depends  on  the  graph  G),t  =  '  (r*  —  1)) 

the  number  of  free  parameters,  W  is  criterion-specific 
( W  =  jn  MDL  and  W  =  1  in  AIC),  and  Lp,  is 
the  log-likelihood  function: 

n  qi  Vi 

lD(o)= log  nil  ipx  (i) 

i— 1 j— 1 k= 1 

where  n,p;  indicates  how  many  elements  of  D  con¬ 
tain  both  x f  and  pa{.  This  function  can  be  writ¬ 
ten  as  Ld(6)  =  I-n.i(9i)-  where  LD.i(9i)  = 

EJU  E2L  x  riijk  log  9ijk-  From  now  on,  the  subscript 
D  is  omitted  for  simplicity. 

An  important  property  of  such  criteria  is  that  they 
are  decomposable,  that  is,  they  can  be  applied  to  each 
node  X,  separately:  maxg  s(G)  =  maxg  Sj(-PAj), 
where  Si(PAi)  =  L^PA^-t^PA^-W,  with  L^PAi )  = 
maxg^  Li{9i)  {0i  is  the  parameter  vector  related  to  Xi, 
so  it  depends  on  the  choice  of  PAi),  and  ti(PAi)  = 
Qi  ■  {i~i  —  1).  Because  of  this  property  and  to  avoid  com¬ 
puting  such  functions  several  times,  we  create  a  cache 
that  contains  Si(PAi)  for  each  Xt  and  each  parent  set 
PAi.  Note  that  this  cache  may  have  an  exponential  size 
on  n,  as  there  are  2n_1  subsets  of  {Xi, . . . ,  X.n}  \  {Xi} 
to  be  considered  as  parent  sets.  This  gives  a  total 
space  and  time  of  0(n  ■  2")  to  build  the  cache.  In¬ 
stead,  the  following  results  show  that  this  number  is 
much  smaller  in  many  practical  cases. 

Lemma  1  Let  Xi  be  a  node  of  Q' ,  a  DAG  for  a  BN 
where  PAi  =  J' .  Suppose  J  C  J'  is  such  that  s,;(J)  > 
Si(J').  Then  J'  is  not  the  parent  set  of  Xt  in  the 
optimal  DAG. 

Proof.  Take  a  graph  Q  that  differs  from  Q'  only 
on  PAi  =  J,  which  is  also  a  DAG  (as  the  removal 
of  some  arcs  does  not  create  cycles)  and  s(G)  = 
Y^jjtisj(PAj)+Si(J)  >  Y^jjtisj(PAj)+Si(J')  =  s{G')- 
Hence  any  DAG  Q'  such  that  PA,  =  J'  has  a  subgraph 
Q  with  a  better  score  than  Q' ,  and  thus  J'  is  not  the 
optimal  parent  configuration  for  Xi.  □ 

Lemma  1  is  quite  simple  but  very  useful  to  discard 
elements  from  the  cache  of  Xi.  However,  it  does  not 
tell  anything  about  supersets  of  J',  that  is,  we  still 
need  to  compute  all  the  possible  parent  configurations 
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and  later  verify  which  of  them  can  be  removed.  Next 
theorems  handle  this  issue. 


Theorem  1  Using  MDL  or  AIC  as  score  function 
and  assuming  N  >  4,  take  Q  and  Q'  DAGs  such  that  Q 
is  a  subgraph  of  Q' .  If  Q  is  such  that  YljePAi  rj  —  N, 
for  some  Xi,  and  Xi  has  a  proper  superset  of  parents 
in  Q'  w.r.t.  Q,  then  Q'  is  not  an  optimal  structure. 

Proof.1  Take  a  DAG  Q  such  that  J  =  PAi  for  a  node 
Xi7  and  take  Q'  equal  to  Q  except  that  it  contains 
an  extra  node  in  Jnew  =  PAt,  that  is,  in  Q'  we  have 
jnew  _  Ju{Xe}.  Note  that  the  difference  in  the  scores 
of  the  two  graphs  are  restricted  to  s*(-).  In  the  graph 
Q' ,  Li(Jnew )  will  certainly  not  decrease  and  ti(Jnew) 
will  increase,  both  with  respect  to  the  values  for  Q. 
The  difference  in  the  scores  will  be  Si(Jnew )  —  s*(J), 
which  equals  to 


3=1 


Li(Jnew)  -  ti(Jnew)  -  (Li(J)  -  ti(J))  < 

1i  Vi 

-EE  Tlijk  log  9ijk  -  ti(JneW )  +  ti(J)  < 

j= 1  *= 1 


ti(JneW)+ti(J)  < 


Qi 

y2n,,lU0,,)  -  U(Jnew)  +  U(J)  < 

3  =  1 
Qi 

E  nU  loS  Ti  -  ®  '  (re  -  X)  '  (r*  “  !)  ‘  w 

3  =  1 


The  first  step  uses  the  fact  that  Li(Jnew)  is  negative, 
the  second  step  uses  that  fact  that  d^k  =  with 

j  n7ij 

nij  =  EE  1riijk,  is  the  value  that  maximizes  Lj(-), 
and  the  last  step  uses  the  fact  that  the  entropy  of  a 
discrete  distribution  is  less  than  the  log  of  its  number 
of  categories.  Finally,  Q  is  a  better  graph  than  Q' 
if  the  last  equation  is  negative,  which  happens  if  qi  ■ 


(re  -  1)  •  (r»  -  1)  •  W  >  IV log r,;. 
rj  —  1  >  log  ri,  and  N  >  4  =>■ 


Because  >  2  => 
>  1  (the  W  of  the 


MDL  case),  we  have  that  q,,  =  Y\jeJri  —  ^  ensures 


that  Si(Jnew)  <  Sj(J),  which  implies  that  the  graph 


Q'  cannot  be  optimal.  □ 


(all  combinations  up  to  log N  parents).  Although  it 
does  not  help  us  to  improve  the  theoretical  size  bound, 
Lemma  2  gives  us  even  less  elements. 

Lemma  2  Let  Xi  be  a  node  with  J  C  J'  two  possible 
parent  sets  such  that  ti(J')  +  Si(J)  >  0.  Then  J'  and 
all  supersets  J"  D  J '  are  not  optimal  parent  configu¬ 
rations  for  Xi . 

Proof.  Because  Lj(-)  is  a  negative  function,  ti(J')  + 
8i(J )  >  o  =►  -ti(J')  -  Si(J )  <  0  =►  (Li(J')  -  ti(J'))  - 
Si(J)  <  0  =>  Si(J')  <  Si(J).  Using  Lemma  1,  we 
have  that  J'  is  not  the  optimal  parent  set  for  Xt.  The 
result  also  follows  for  any  J"  D  J ,  as  we  know  that 
ti{J")  >  U(J').  □ 

Thus,  the  idea  is  to  check  the  validity  of  Lemma  2  ev¬ 
ery  time  the  score  of  a  parent  set  J'  of  X,  is  about 
to  be  computed,  discarding  J'  and  all  supersets  when¬ 
ever  possible.  This  result  allows  us  to  stop  computing 
scores  for  J'  and  all  its  supersets.  Lemma  1  is  stronger, 
but  regards  a  comparison  between  exactly  two  parent 
configuration.  Nevertheless,  Lemma  1  can  be  applied 
to  the  final  cache  to  remove  all  certainly  useless  parent 
configurations.  As  we  see  in  Section  5,  the  practical 
size  of  the  cache  after  these  properties  is  small  even 
for  large  networks.  Lemma  1  is  also  valid  for  other  de¬ 
composable  functions,  including  BD  and  derivations 
(e.g.  BDe,  BDeu),  so  the  benefits  shall  apply  to  those 
scores  too,  and  the  memory  requirements  will  be  re¬ 
duced.  The  other  theorems  need  assumptions  about 
the  initial  N  and  the  choice  of  priors.  Further  discus¬ 
sion  is  left  for  future  work  because  of  lack  of  space. 

3.  Constraints 

An  additional  way  to  reduce  the  space  of  possible 
DAGs  is  to  consider  some  constraints  provided  by  ex¬ 
perts.  We  work  with  two  main  types  of  constraints: 
constraints  on  parameters  that  define  rules  about  the 
probability  values  inside  the  local  distributions  of  the 
network,  and  structural  constraints  that  specify  where 
arcs  may  or  may  not  be  included. 


Corollary  1  In  the  optimal  structure  Q,  each  node 
has  at  most  G(log  N)  parents. 

Proof.  It  follows  directly  from  Theorem  1  and  the 
fact  that  r'i  >  2,  for  all  X,;.  □ 

Theorem  1  and  Corollary  1  ensures  that  the  cache 
stores  at  most  0(( j””^))  elements  for  each  variable 

1  Another  similar  proof  appears  in  (Bouckaert,  1994), 
but  it  leads  directly  to  the  conclusion  of  Corollary  1.  The 
intermediate  result  is  algorithmically  important. 


3.1.  Parameter  Constraints 

We  work  with  a  general  definition  of  parameter  con¬ 
straint,  where  any  convex  constraint  is  allowed.  If 
Si, PAi  is  the  parameter  vector  of  the  node  X,  with 
parent  set  PAi,  then  a  convex  constraint  is  defined  as 
h(0i,pAi )  <  0,  where  h  :  Sloi  PA.  — » 1Z  is  a  convex  func¬ 
tion  over  Si, pAp  This  definition  includes  many  well 
known  constraints,  for  example  from  Qualitative  Prob¬ 
abilistic  Networks  (QPN)  (Wellman,  1990):  qualitative 
influences  define  some  knowledge  about  the  state  of 
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a  variable  given  the  state  of  another,  which  roughly 
means  that  observing  a  greater  state  for  a  parent  Xa  of 
a  variable  Xb  makes  more  likely  to  have  greater  states 
in  Xf,  (for  any  parent  configuration  except  for  Xa ).  For 
example,  dbj2 2  >  Objrf,  where  jk  =  {a :^,pal*}  and  j*  is 
an  index  ranging  over  all  parent  configurations  except 
for  Xa.  In  this  case,  observing  x2  makes  more  likely  to 
have  x2.  A  negative  influence  is  obtained  by  replacing 
the  inequality  operator  >  by  <,  and  a  zero  influence  is 
obtained  by  changing  inequality  to  an  equality.  Other 
constraints  such  as  synergies  (Wellman,  1990)  are  also 
linear  and  local  to  a  single  node. 

Although  we  allow  the  parameter  constraints  that  are 
general,  we  have  the  following  restriction  about  them: 
if  a  constraint  is  specified  for  a  node  X,  and  a  set 
of  parents  J,  then  the  actual  parent  set  PAi  has  to 
be  a  superset  of  J.  Furthermore,  we  have  a  pecu¬ 
liar  interpretation  for  each  constraint  C  as  follows:  if 
J  C  PAi  (proper  subset),  then  the  parameter  con¬ 
straint  must  hold  for  all  configurations  of  the  parents 
of  Xj  that  do  not  belong  to  J.  For  example,  sup¬ 
pose  Xi  has  X2  and  A3  as  parents  (all  of  them  bi¬ 
nary),  and  the  following  constraint  h  was  defined  on 
X\:  p(x\\x\x\)  +  2  ■  p(x\\x2x\)  <  1.  If  a  new  node  X4 
is  included  as  parent  of  Xi,  the  constraint  h  becomes 
the  two  following  constraints: 

p(x{\xlx\x\)  +  2  ■  p{x\\x\x\x\)  <  1, 

p(x\\xlx\x\)  +  2  ■  p{x\\xlx\x\)  <  1, 

that  is,  h  holds  for  each  state  of  X4.  For  example 
if  another  parent  X5  is  included,  then  four  constraints 
would  be  enforced  with  all  possible  combinations.  This 
interpretation  for  constraints  is  in  line  with  the  defi¬ 
nition  of  qualitative  constraints  of  QPNs,  and  most 
importantly,  it  allows  us  to  treat  the  constraints  in  a 
principled  way  for  each  set  of  parents.  It  means  that 
the  constraint  must  hold  for  all  configurations  of  par¬ 
ents  not  involved  in  the  constraint,  which  can  be  also 
interpreted  as  other  parents  are  not  relevant  and  the 
constraint  is  valid  for  each  one  of  their  configurations. 

3.2.  Structural  constraints 

Besides  probabilistic  constraints,  we  work  with  struc¬ 
tural  constraints  on  the  possible  graphs.  These  con¬ 
straints  help  to  reduce  the  search  space  and  are  avail¬ 
able  in  many  situations.  We  work  with  the  following 
rules: 

•  indegree(Xj,k}op),  where  op  £  {lt,eq}  and  k  an 
integer,  means  that  the  node  Xj  must  have  less 
than  (when  op  =  It)  or  equal  to  (when  op  =  eq)  k 
parents. 


•  arc{Xi,  Xj)  indicates  that  the  node  Aj  must  be  a 
parent  of  Xj. 

•  Operators  or  (V)  and  not  (->)  are  used  to  form  the 
rules.  The  and  operator  is  not  explicitly  used  as 
we  assume  that  each  constraint  is  in  disjunctive 
normal  form. 

For  example,  the  constraints  \/i^Cj^c  -*arc(Xi,  Xj) 
and  indegree(Xc,  0,  eq)  impose  that  only  arcs  from 
node  Xc  to  the  others  are  possible,  and  that  Xc  is 
a  root  node,  that  is,  a  Naive  Bayes  structure  will  be 
learned.  The  procedure  will  also  act  as  a  feature  se¬ 
lection  procedure  by  letting  some  variables  unlinked. 
Note  that  the  symbol  V  just  employed  is  not  part  of 
the  language  but  is  used  for  easy  of  expose  (in  fact 
it  is  necessary  to  write  down  every  constraint  defined 
by  such  construction).  As  another  example,  the  con¬ 
straints  indegree(Xj,  3,  It),  indegree(Xc,  0,eq), 
and  indegree(Xj,  0,  eq)  V  arc(Ac,  Xj)  ensure  that 
all  nodes  have  Xc  as  parent,  or  no  parent  at  all.  Be¬ 
sides  Xc,  each  node  may  have  at  most  one  other  par¬ 
ent,  and  Xc  is  a  root  node.  This  learns  the  structure 
of  a  Tree-augmented  Naive  (TAN)  classifier,  also  per¬ 
forming  a  kind  of  feature  selection  (some  variables  may 
end  up  unlinked).  In  fact,  it  learns  a  forest  of  trees,  as 
we  have  not  imposed  that  all  variables  must  be  linked. 

3.3.  Dealing  with  constraints 

All  constraints  in  previous  examples  can  be  imposed 
during  the  construction  of  the  cache,  because  they  in¬ 
volve  just  a  single  node  each.  In  essence,  parent  sets 
of  a  node  Xi  that  do  violate  some  constraint  are  not 
stored  in  the  cache,  and  this  can  be  checked  during  the 
cache  construction.  On  the  other  hand,  constraints 
such  as  arc(Xi,X2)  V  arc(X2,  X3)  cannot  be  imposed 
in  that  stage,  as  they  impose  a  non-local  condition  (the 
arcs  go  to  distinct  variables,  namely  X2  and  A3),  be¬ 
cause  the  cache  construction  is  essentially  a  local  pro¬ 
cedure  with  respect  to  each  variable.  Such  constraints 
that  involve  distinct  nodes  can  be  verified  during  the 
B&B  phase,  so  they  are  addressed  later. 

Regarding  parameter  constraints,  we  compute  the 
scores  using  a  constrained  optimization  problem,  i.e. 
maximize  the  score  function  subject  to  simplex  equal¬ 
ity  constraints  and  all  parameter  constraints  defined 
by  the  user. 

ma xLi(0i)  -  ti(PAi) 

subject  to  Vj =i...gi  gij(0ij)  =  0,  (2) 

Vz=l...mhi  hiz(9i)  <  0, 

where  g%j{0ij)  =  —  1+X)fc’=1  @ijk  imposes  that  distribu¬ 
tions  defined  for  each  variable  given  a  parent  configura- 
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tion  sum  one  over  all  variable  states,  and  the  m^i  con¬ 
vex  constraints  h-iz  define  the  space  of  feasible  param¬ 
eters  for  the  node  Xi: .  This  is  possible  because:  (1)  we 
have  assumed  that  a  constraint  over  p(x\ \x^ , . . . ,  x *‘) 
forces  Xilt . . .  ,Xit  C  PA,,  that  is,  when  a  parame¬ 
ter  constraint  is  imposed,  the  parent  set  of  the  node 
must  contain  at  least  the  variables  involved  in  the  con¬ 
straint;  (2)  the  optimization  is  computed  for  every  pos¬ 
sible  parent  set,  that  is,  PA,  is  known  in  the  moment  to 
write  down  the  optimization  problem,  which  is  solved 
for  each  X,  and  each  set  PAi.  We  use  the  optimization 
package  of  (Birgin  et  al.,  2000). 

Theorem  2  Using  MDL  or  AIC  as  score  function 
and  assuming  N  >  4,  take  Q  and  Q'  as  DAGs  such 
that  Q  is  a  subgraph  of  Q' .  Suppose  that  both  Q  and  Q' 
respect  the  same  set  of  parameter  and  structural  con¬ 
straints.  If  G  is  such  that  W^PAi  rj  >  N,  for  some 
Xit  and  Xi  has  a  proper  superset  of  parents  in  Q'  w.r.t. 
G,  then  Q'  is  not  an  optimal  structure. 

Proof.  Just  note  that  all  derivations  in  Theorem  1 
are  also  valid  in  the  case  of  constraints.  The  only  dif¬ 
ference  that  deserves  a  comment  is  9ijk  —  which 
may  be  an  unfeasible  point  for  the  optimization  (2), 
because  the  latter  contains  parameter  constraints  that 
might  reduce  the  parameter  space  (besides  the  normal 
constraints  of  the  maximum  log-likelihood  problem). 
As  dijk  is  just  used  as  an  upper  value  for  the  log- 
likelihoocl  function,  and  the  constrained  version  can 
just  obtain  smaller  objective  values  than  the  uncon¬ 
strained  version,  js  an  upper  bound  also  for  the 

riij 

constrained  case.  Thus,  the  derivation  of  Theorem  1 
is  valid  even  with  constraints.  □ 

Corollary  1  and  Lemmas  1  and  2  are  also  valid  in  this 
setting.  The  proof  of  Corollary  1  is  straightforward,  as 
it  only  depends  on  Theorem  1,  while  for  Lemmas  1  and 
2  we  need  just  to  ensure  that  all  the  parent  configura¬ 
tions  that  are  discussed  there  respect  the  constraints. 

4.  Constrained  B&B  algorithm 

In  this  section  we  describe  the  B&B  algorithm  used 
to  find  the  best  structure  of  the  BN  and  comment  on 
its  complexity,  correctness,  and  some  extensions  and 
particular  cases.  The  notation  (and  initialization  of 
the  algorithm)  is  as  follows:  C  :  (A,;,  PAi)  — >  TZ  is  the 
cache  with  the  scores  for  all  the  variables  and  their 
possible  parent  configurations  (using  Theorem  1  and 
Lemmas  1  and  2  to  have  a  reduced  size);  Q  is  the 
graph  created  taking  the  best  parent  configuration  for 
each  node  without  checking  for  acyclicity  (so  it  is  not 
necessarily  a  DAG),  and  s  is  the  score  of  G',  PL  is  an 
initially  empty  matrix  containing,  for  each  possible  arc 


between  nodes,  a  mark  stating  that  the  arc  must  be 
present,  or  is  prohibited,  or  is  free  (may  be  present  or 
not);  Q  is  a  priority  queue  of  triples  (G,  PL,  s),  ordered 
by  s  (initially  it  contains  a  single  triple  with  Q ,  PL  and 
s  just  mentioned;  and  finally  (Gbest,  Sbest)  is  the  best 
DAG  and  score  found  so  far  (sj,est  is  initialized  with 
— oo).  The  main  loop  is  as  follows: 

While  Q  is  not  empty,  do 

1.  Remove  the  peek  (Gcur,PLCur,  scur)  of  Q.  If  s  < 
Sbest  (worst  than  an  already  known  solution),  then 
start  the  loop  again.  If  Qcur  is  a  DAG  and  satis¬ 
fies  all  structural  constraints,  update  ( Gbest,  Sbest) 
with  (Gcur,  sCUr)  and  start  the  loop  again. 

2.  Take  v  =  (Xai  ->  Xa2  Xaq+1),  with 

a±  =  aq+ 1,  is  a  directed  cycle  of  Gcur- 

3.  For  y  =  1, . . . ,  q,  do 

•  Mark  on  PLcur  that  the  arc  Xay  — >  Xay+1  is 
prohibited. 

•  Recompute  (G,s)  from  (Gcur,  scur)  such  that 
the  parents  of  XGy+1  in  G  comply  with 
this  restriction  and  with  PLCUr-  Further¬ 
more,  the  subgraph  of  Q  formed  by  arcs 
that  are  demanded  by  PLcur  (those  that  have 
a  mark  must  exist)  must  comply  with  the 
structural  constraints  (it  might  be  impossi¬ 
ble  to  get  such  graph.  In  such  case,  go 
to  the  last  bullet).  Use  the  values  in  the 
cache  C(Xay+, ,  PA0y+, )  to  avoid  recomput¬ 
ing  scores. 

•  Include  the  triple  (G,  PLcur,  s)  into  Q. 

•  Mark  on  PLcur  that  the  arc  Xay  — >  Xay+1 
must  be  present  and  that  the  sibling  arc 
XaB+i  — >  Xay  is  prohibited,  and  continue. 

The  algorithm  uses  a  B&B  search  where  each  case  to 
be  solved  is  a  relaxation  of  a  DAG,  that  is,  they  may 
contain  cycles.  At  each  step,  a  graph  is  picked  up  from 
a  priority  queue,  and  it  is  verified  if  it  is  a  DAG.  In  such 
case,  it  is  a  feasible  structure  for  the  network  and  we 
compare  its  score  against  the  best  score  so  far  (which 
is  updated  if  needed).  Otherwise,  there  must  be  a 
directed  cycle  in  the  graph,  which  is  then  broken  into 
subcases  by  forcing  some  arcs  to  be  absent /present. 
Each  subcase  is  put  in  the  queue  to  be  processed.  The 
procedure  stops  when  the  queue  is  empty.  Note  that 
every  time  we  break  a  cycle,  the  subcases  that  are 
created  are  independent,  that  is,  the  sets  of  graphs 
that  respect  PL  for  each  subcase  are  disjoint.  We  obtain 
this  fact  by  properly  breaking  the  cycles:  when  v  = 
(Xai  — >  Xa2  Xaq+1)  is  detected,  we  create  q 
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subcases  such  that  the  first  does  not  contain  Xai  — ■> 
Xa2  (but  may  contain  the  other  arcs  of  that  cycle), 
the  second  case  certainly  contains  Xai  — ■>  Xa2,  but 
Xa2  — >  Xa3  is  prohibited  (so  they  are  disjoint  because 
of  the  difference  in  the  presence  of  the  first  arc),  and  so 
on  such  that  the  y-th  case  certainly  contains  Xa  ,  — > 
Xayl+1  for  all  y'  <  y  and  prohibits  Xay  ->  Xay+1. 
This  idea  ensures  that  we  never  process  the  same  graph 
twice.  So  the  algorithm  runs  at  most  Hi  |C'PQ)|  steps, 
where  | C{X/)\  is  the  size  of  the  cache  for  X*. 

B&B  can  be  stopped  at  any  time  and  the  current  best 
solution  as  well  as  an  upper  bound  for  the  global  best 
score  are  available.  This  stopping  criterion  might  be 
based  on  the  number  of  steps,  time  and/or  memory 
consumption.  Moreover,  the  algorithm  can  be  easily 
parallelized.  We  can  split  the  content  of  the  priority 
queue  into  many  different  tasks.  No  shared  memory 
needs  to  exist  among  tasks  if  each  one  has  its  own  ver¬ 
sion  of  the  cache.  The  only  data  structure  that  needs 
consideration  is  the  queue,  which  from  time  to  time 
must  be  balanced  between  tasks.  With  a  message¬ 
passing  idea  that  avoids  using  process  locks,  the  gain 
of  parallelization  is  linear  in  the  number  of  tasks.  As 
far  as  we  know,  best  known  exact  methods  are  not 
easily  parallelized,  they  do  not  deal  with  constraints, 
and  they  do  not  provide  lower  and  upper  estimates  of 
the  best  structure  if  stopped  early.  If  run  until  it  ends, 
the  proposed  method  gives  a  global  optimum  solution 
for  the  structure  learning  problem. 

Some  particular  cases  of  the  algorithm  are  worth  men¬ 
tioning.  If  we  fix  an  ordering  for  the  variables  such 
that  all  the  arcs  must  link  a  node  towards  another 
non-precedent  in  the  ordering  (this  is  a  common  idea  in 
many  approximate  methods),  the  proposed  algorithm 
does  not  perform  any  branch,  as  the  ordering  implies 
acyclicity,  and  so  the  initial  solution  is  already  the 
best.  The  performance  would  be  proportional  to  the 
time  to  create  the  cache.  On  the  other  hand,  bounding 
the  maximum  number  of  parents  of  a  node  is  relevant 
only  for  hardest  inputs,  as  it  would  imply  a  bound  on 
the  cache  size,  which  is  already  empirically  small. 

5.  Experiments 

We  perform  experiments  to  show  the  benefits  of  the 
reduced  cache  and  search  space  and  the  gains  of  con¬ 
straints.2  First,  we  use  data  sets  available  at  the  UCI 
repository  (Asuncion  &  Newman,  2007).  Lines  with 
missing  data  are  removed  and  continuous  variables  are 
discretized  over  the  mean  into  binary  variables.  The 

2The  software  is  available  online  through  the  web  ad¬ 
dress  http://www.ecse.rpi.edu/~cvrl/structlearning.html 


data  sets  are:  adult  (15  variables  and  30162  instances), 
car  (7  variables  and  1728  instances)  letter  (17  variables 
and  20000  instances),  lung  (57  variables  and  27  in¬ 
stances),  mushroom  (23  variables  and  1868  instances), 
nursery  (9  variables  and  12960  instances),  Wisconsin 
Diagnostic  Breast  Cancer  or  wdbc  (31  variables  and 
569  instances),  zoo  (17  variables  and  101  instances). 
No  constraints  are  employed  in  this  phase  as  we  intend 
to  show  the  benefits  of  the  properties  earlier  discussed. 

Table  1  presents  the  cache  construction  results,  ap¬ 
plying  Theorem  1  and  Lemmas  1  and  2.  Its  columns 
show  the  data  set  name,  the  number  of  steps  the  proce¬ 
dure  spends  to  build  the  cache  (a  step  equals  to  a  call 
to  the  score  function  for  a  single  variable  and  a  par¬ 
ent  configuration),  the  time  in  seconds,  the  size  of  the 
generated  cache  (number  of  scores  stored,  the  mem¬ 
ory  consumption  is  actually  0{n)  times  that  number), 
and  finally  the  size  of  the  cache  if  all  scores  were  com¬ 
puted.  Note  that  the  reduction  is  huge.  Although  in 
the  next  we  are  going  to  discuss  three  distinct  algo¬ 
rithms,  the  benefits  of  the  application  of  these  results 
imply  in  performance  gain  for  other  algorithms  in  the 
literature  to  learn  BN  structures.  It  is  also  possible 
to  analyze  the  search  space  reduction  implied  by  these 
results  by  looking  columns  2  and  3  of  Table  2. 

Table  1.  Cache  sizes  (number  of  stored  scores)  and  time  (in 
seconds)  to  build  them  for  many  networks  and  data  sizes. 
Steps  represent  the  number  of  local  (single  node  given  a 
parent  set)  score  evaluations. 


name 

steps 

time(s) 

size 

n2” 

adult 

30058 

182.09 

672 

21Y-y 

car 

335 

0.09 

24 

to 

00 

oo 

letter 

534230 

2321.46 

41562 

220  1 

lung 

43592 

1.33 

3753 

261.8 

mushroom 

140694 

72.13 

8217 

226.5 

nursery 

1905 

3.94 

49 

211-2 

wdbc 

1692158 

351.04 

7482 

235 

zoo 

9118 

0.31 

1875 

220'1 

In  Table  2,  we  show  results  of  three  distinct  algorithms: 
the  B&B  described  in  Section  4,  the  dynamic  program¬ 
ming  (DP)  idea  of  (Silander  &  Myllymaki,  2006),  and 
an  algorithm  that  picks  variable  orderings  randomly 
and  then  find  the  best  structure  such  that  all  arcs  link 
a  node  towards  another  that  is  not  precedent  in  the  or¬ 
dering.  This  last  algorithm  (named  OS)  is  similar  to 
K2  algorithm  with  random  orderings,  but  it  is  always 
better  because  a  global  optimum  is  found  for  each  or¬ 
dering.3  The  scores  obtained  by  each  algorithm  (in 
percentage  against  the  value  obtained  by  B&B)  and 

3 We  have  run  a  hill-climbing  approach  (which  is  also 
benefited  by  ideas  presented  in  this  paper),  but  its  accuracy 
was  worse  than  OS.  We  omit  it  because  of  lack  of  space. 
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Table  2.  Comparison  of  MDL  scores  among  B&B,  dynamic  programming  (DP),  and  ordering  sampling  (one  thousand 
times).  Fail  means  that  it  could  not  solve  the  problem  within  10  million  steps.  DP  and  OS  scores  are  in  percentage  w.r.t. 
the  score  of  B&B  (positive  percentage  means  worse  than  B&B  and  negative  percentage  means  better). 


network 

search 

space 

reduced 

space 

score 

B&B 

gap 

time(s) 

DP 

score  time(s) 

OS 

score  time(s) 

adult 

27W 

2Y1 

-286902.8 

5.5% 

150.3 

0.0% 

0.77 

0.1% 

0.17 

car 

242 

210 

-13100.5 

0.0% 

0.01 

0.0% 

0.01 

0.0% 

0.01 

letter 

2272 

2188 

-173716.2 

8.1% 

574.1 

-0.6% 

22.8 

1.0% 

0.75 

lung 

23192 

2330 

-1146.9 

2.5% 

907.1 

Fail 

Fail 

1.0% 

0.13 

mushroom 

2506 

2180 

-12834.9 

15.3% 

239.8 

Fail 

Fail 

1.0% 

0.12 

nursery 

2  72 

217 

-126283.2 

0.0% 

0.04 

0.0% 

0.04 

0.0% 

0.04 

wdbc 

9930 

2216 

-3053.1 

13.6% 

333.5 

Fail 

Fail 

0.8% 

0.13 

zoo 

2272 

2111 

-773.4 

0.0% 

5.2 

0.0% 

3.5 

1.0% 

0.03 

the  corresponding  spent  time  are  presented  (excluding 
the  cache  construction) .  A  limit  of  ten  million  steps  is 
given  to  each  method  (steps  here  are  considered  as  the 
number  of  queries  to  the  cache).  It  is  also  presented 
the  reduced  space  where  B&B  performs  its  search,  as 
well  as  the  maximum  gap  of  the  solution.  This  gap 
is  obtained  by  the  relaxed  version  of  the  problem.  So 
we  can  guarantee  that  the  global  optimal  solution  is 
within  this  gap  (even  though  the  solution  found  by  the 
B&B  may  already  be  the  best,  as  shown  in  the  first 
line  of  the  table).  With  the  reduced  cache  presented 
here,  finding  the  best  structure  for  a  given  ordering  is 
very  fast,  so  it  is  possible  to  run  OS  over  millions  of 
orderings  in  a  short  period  of  time. 

Table  3.  B&B  procedure  learning  TANs.  Time  (in  seconds) 
to  find  the  global  optimum,  cache  size  (number  of  stored 
scores)  and  (reduced)  space  for  B&B  search. 


network 

time(s) 

cache  size 

space 

adult 

0.26 

114 

2s9 

car 

0.01 

14 

26-2 

letter 

0.32 

233 

261 

lung 

0.26 

136 

251 

mushroom 

0.71 

398 

00 

00 

CN 

nursery 

0.06 

26 

212 

wdbc 

361.64 

361 

2" 

Some  additional  comments  are  worth.  DP  can  solve 
the  mushroom  set  in  less  than  10  minutes  if  we  drop 
the  limit  of  steps.  The  expectation  for  wdbc  is  around 
four  days.  Hence,  we  cannot  expect  to  obtain  an  an¬ 
swer  in  larger  cases,  such  as  lung.  It  is  clear  that,  in 
worst  case,  the  number  of  steps  of  DP  is  smaller  than 
that  of  B&B.  Nevertheless,  B&B  eventually  bounds 
some  regions  without  processing  them,  provides  an  up¬ 
per  bound  at  each  iteration,  and  does  not  suffer  from 
memory  exhaustion  as  DP.  This  makes  the  method  ap¬ 
plicable  even  to  very  large  settings.  Still,  DP  seems  a 
good  choice  for  small  n.  When  n  is  large  (more  than 
35),  DP  will  not  finish  in  reasonable  time,  and  hence 


will  not  provide  any  solution,  while  B&B  still  gives  an 
approximation  and  a  bound  to  the  global  optimum. 
About  OS,  if  we  sample  one  million  times  instead  of 
one  thousand  as  done  before,  its  results  improve  and 
the  global  optimum  is  found  also  for  adult  and  mush¬ 
room  sets.  Still,  OS  provides  no  guarantee  or  estima¬ 
tion  about  how  far  is  the  global  optimum  (here  we 
know  it  has  achieved  the  optimum  because  of  the  ex¬ 
act  methods).  It  is  worth  noting  that  both  DP  and 
OS  are  benefited  by  the  smaller  cache. 

Table  3  shows  the  results  when  we  employ  constraints 
to  force  the  final  network  to  be  a  Tree-augmented 
Naive  Bayes  ( zoo  was  run,  but  it  is  not  included  be¬ 
cause  the  unconstrained  learned  network  was  already 
TAN).  Here  the  class  is  isolated  in  the  data  set  and 
constraints  are  included  as  described  in  Section  3.2. 
Note  that  the  cache  size,  the  search  space  and  con¬ 
sequently  the  time  to  solve  the  problems  have  all  de¬ 
creased.  Finally,  Table  4  has  results  for  random  data 
sets  with  predefined  number  of  nodes  and  instances.  A 
randomly  created  BN  with  at  most  3n  arcs  is  used  to 
sample  the  data.  Because  of  that,  we  are  able  to  gener¬ 
ate  random  parameter  and  structural  constraints  that 
are  certainly  valid  for  this  true  BN  (approximately  n/2 
constraints  for  each  case).  The  table  contains  the  to¬ 
tal  time  to  run  the  problem  and  the  size  of  the  cache, 
together  with  the  percentage  of  gain  when  using  con¬ 
straints.  Note  that  the  code  was  run  in  parallel  with 
a  number  of  tasks  equals  to  n,  otherwise  an  increase 
by  a  factor  of  n  must  be  applied  to  the  results  in  the 
table.  We  can  see  that  the  gain  is  recurrent  in  all  cases 
(the  constrained  version  has  also  less  gap  in  all  cases, 
although  such  number  is  not  shown). 

6.  Conclusions 

This  paper  describes  a  novel  algorithm  for  learning 
BN  structure  from  data  and  expert’s  knowledge.  It 
integrates  structural  and  parameter  constraints  with 
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Table  4.  Results  on  random  data  sets  generated  from  ran¬ 
dom  networks.  Time  to  solve  (10  million  steps)  and  size  of 
the  cache  are  presented  for  the  normal  unconstrained  case 
and  the  percentage  of  gain  when  using  constraints. 


nodes  (n)/ 
instances 

unconstrained 
gap  time(s)  cache 

constrained  gain 
time  cache 

30/100 

0% 

0.06 

125 

67% 

11.6% 

30/500 

0% 

2.7 

143 

47.5% 

26.5% 

50/100 

0% 

0.26 

310 

31.4% 

16.1% 

50/500 

0% 

20.66 

231 

57.2% 

29.8% 

70/100 

0% 

4.58 

1205 

36.9% 

18.8% 

70/500 

1.1% 

356.9 

666 

38.4% 

21.9% 

100/100 

0.5% 

9.05 

2201 

47.5% 

23.5% 

100/500 

1.4% 

1370.4 

726 

50.2% 

33.0% 

data  through  a  B&B  procedure  that  guarantees  global 
optimality  with  respect  a  decomposable  score  function. 
It  is  an  any-time  procedure  in  the  sense  that,  if  stopped 
early,  it  provides  the  best  current  solution  found  so  far 
and  a  maximum  error  of  such  solution.  The  software 
is  available  as  described  in  the  experiments. 

We  also  describe  properties  of  the  structure  learning 
problem  based  on  scoring  DAGs  that  enable  the  B&B 
procedure  presented  here  as  well  as  other  methods  to 
work  over  a  reduced  search  space  and  memory.  Such 
properties  allow  the  construction  of  a  cache  with  all 
possible  local  scores  of  nodes  and  their  parents  without 
large  memory  consumption. 

Because  of  the  properties  and  the  characteristics  of 
the  B&B  method,  even  without  constraints  the  B&B 
is  more  efficient  than  state-of-the-art  exact  methods 
for  large  domains.  We  show  through  experiments  with 
randomly  generated  data  and  public  data  sets  that 
problems  with  up  to  70  nodes  can  be  exactly  processed 
in  reasonable  time,  and  problems  with  100  nodes  are 
handled  within  a  small  worst  case  error.  These  results 
surpass  by  far  current  methods,  and  may  also  help 
to  improve  other  approximate  methods  and  may  have 
interesting  practical  applications,  which  we  will  pursue 
in  future  work. 
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1  Introduction 

An  influence  diagram  is  a  graphical  model  for  decision  making  under  uncertainty 
[7].  It  is  composed  by  a  directed  graph  where  utility  nodes  are  associated  to 
profits  and  costs  of  actions,  chance  nodes  represent  uncertainties  and  dependen¬ 
cies  in  the  domain  and  decision  nodes  represent  actions  to  be  taken.  Given  an 
influence  diagram,  a  strategy  defines  which  decision  to  take  at  each  node,  given 
the  information  available  at  that  moment.  Each  strategy  has  a  corresponding 
expected  utility.  One  important  problem  in  influence  diagrams  is  learning  the 
model,  which  includes  the  elicitation  of  probability  distributions  for  the  chance 
nodes  and  utility  functions  for  the  utility  nodes.  The  direct  elicitation  of  such 
values  using  expert  knowledge  may  be  replaced  by  an  automatic  learning  pro¬ 
cedure  when  a  data  set  of  past  events  is  available. 

In  this  paper,  we  propose  new  ideas  to  learn  the  parameters  of  a  Limited 
Memory  Influence  Diagram,  or  simply  LIMID.  LIMIDs  represent  a  very  general 
class  of  influence  diagrams,  and  include  the  most  traditional  case  (introduced 
initially  by  [7])  as  subcase  [8].  Limited  Memory  means  that  the  assumption  of 
no-forgetting  usually  employed  in  standard  Influence  Diagrams  (that  is,  values 
of  observed  variables  and  decisions  that  have  been  taken  are  remembered  at  all 
later  times)  is  relaxed.  Because  LIMIDs  are  general  and  do  not  have  assump¬ 
tions  about  no-forgetting  and  ordering  for  decisions,  it  is  possible  to  efficiently 
convert  diagrams  that  have  such  assumptions  into  LIMIDs.  Hence,  LIMIDs  are 
more  powerful  (in  the  sense  of  expressiveness)  than  other  influence  diagrams. 
The  benefits  of  having  a  more  expressive  model  come  with  the  additional  compu¬ 
tational  cost.  Thus,  specialized  algorithms  that  exploit  LIMID’s  characteristics 
are  needed. 

We  describe  a  learning  procedure  for  LIMIDs  that  is  motivated  by  previous 
work  on  the  traditional  influence  diagram  [10,  9,  1].  We  assume  that  the  decision 
maker  takes  rational  decisions,  which  leads  to  the  maximization  of  the  expected 
utility.  This  model  requires  two  types  of  information:  the  probabilities  and  the 
utilities  of  all  possible  outcomes  of  the  decision  problem.  Probability  values  are 
obtained  by  standard  learning  procedures,  usually  borrowed  from  the  theory  of 
Bayesian  networks. 

The  estimation  of  the  utility  function  can  be  addressed  by  elicitation  from 
experts,  which  need  to  answer  a  sequence  of  questions  about  their  preferences, 
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or  by  automatic  procedures  that  receive  as  input  the  past  behavior  of  experts 
through  a  data  set  of  cases.  Dealing  with  the  experts  may  be  difficult  and  gen¬ 
erate  errors,  as  previously  reported  [6].  On  the  other  hand,  the  data  set  of  past 
cases  cam  be  used  to  create  a  set  of  constraints  on  the  space  of  possible  utility 
functions,  from  where  later  a  conservative  (e.g.  by  using  a  maximin  approach) 
or  a  more  aggressive  (e.g.  by  taking  any  admissible  function  inside  the  set) 
utility  function  is  drawn.  In  the  literature,  we  find  heuristics  [9]  and  probabilis¬ 
tic  approaches  [1].  Here  we  assume  that  a  database  of  past  events,  decisions 
and  rewards  is  available  to  train  the  influence  diagram,  which  is  a  reasonable 
assumption  in  military  problems,  and  we  use  this  database  to  learn  both  the 
probability  values  and  the  utility  functions.  The  behavior  of  the  experts,  en¬ 
coded  in  the  database,  is  employed  to  create  constraints  on  the  utility  function 
of  the  problem.  Such  constraints  are  integrated  in  the  strategy  selection  process 
using  the  ideas  we  have  developed  [5]. 

2  LIMIDs:  Limited  Memory  Influence  Diagrams 

A  Limited  Memory  Influence  Diagram  X  is  composed  by  a  directed  acyclic  graph 
(V,E)  where  nodes  are  partitioned  in  three  types:  chance,  decision  and  utility 
nodes.  Let  C,  V  and  U  be  the  set  of  chance,  decision  and  utility  nodes,  re¬ 
spectively,  and  let  X  =  C  U  V.  Links  of  E  characterize  dependencies  among 
nodes.  Explicitly,  links  toward  a  chance  node  indicate  probabilistic  dependence 
of  the  node  on  its  parents;  links  toward  a  decision  node  indicate  which  informa¬ 
tion  is  available  to  take  such  decision,  and  links  toward  utility  nodes  represent 
that  an  utility  for  those  parents  is  to  be  considered  (utility  nodes  may  not  have 
children).  Associated  to  each  node,  there  are  some  parameters: 

1.  A  chance  node  has  an  associated  categorical  random  variable  C  with  finite 
domain  and  conditional  probability  distributions  p(C\nj(C)),  for  each 
configuration  7 Tj(C)  of  its  parents  7 r(C)  in  the  graph,  j  is  used  to  indicate 
a  configuration  of  the  parents  of  C,  that  is,  7 tj(C)  €  0,^(0):  where  the 
notation  fly'  =  XyeV'  fV,  for  any  V'  C  V. 

2.  A  decision  node  D  is  associated  to  a  finite  set  of  mutually  exclusive  alter¬ 
natives  VLd  ■  Parents  of  D  describe  the  information  that  is  available  at  the 
moment  on  which  decision  D  has  to  be  taken. 

3.  An  utility  node  U  is  associated  to  a  rational  function  fu  :  £1^(1?)  ~ >  Q-  The 
value  corresponding  to  a  parent  configuration  is  the  profit  (cost  is  viewed 
as  negative  profit)  of  such  parent  configuration.  Utility  nodes  have  no 
children. 

A  simple  example  is  depicted  in  Figure  1.  Decision  nodes  are  represented  by 
rectangles,  chance  nodes  by  ellipses  and  utility  nodes  by  diamonds,  do -ground-attack 
has  an  associated  cost,  which  is  depicted  by  the  corresponding  utility  node.  The 
same  is  modeled  for  bomb-bridge.  The  goal  is  to  achieve  territory -occupation, 
which  also  has  an  utility  (the  profit  of  the  goal) .  ground-attack  and  bridge-condition 
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Figure  1:  Simple  Influence  Diagram  example. 

represent  the  uncertain  outcomes  of  the  corresponding  actions.  To  simplify  no¬ 
tation,  we  denote  the  nodes  by  their  initial  letters  as  follows:  do -ground-attack  is 
denoted  by  DGA,  bomb-bridgeby  BB ,  territory -occupation  by  TO,  ground-attack 
by  GA,  and  so  on. 

Note  that  it  is  not  assumed  to  exist  a  known  ordering  (among  decision  nodes) 
on  which  decisions  must  be  taken,  as  it  is  done  in  simpler  versions  of  influence 
diagrams.  Although  decision  nodes  have  no  parents  in  the  example  of  Figure  1, 
this  is  not  a  restriction  of  the  model. 

A  policy  5d  for  the  decision  node  D  is  a  function  6d  ■  ^DUn(D)  — >  [0, 1] 
defined  for  each  alternative  of  D  and  each  configuration  of  7 r(D)  such  that,  for 
each  7 Tj(Z?)  G  0^  we  have  nD  ^o{d,  tt j(D))  =  1.  A  pure  policy  is  a  policy 
such  that  its  image  is  integer  (Sp  :  Qdutt(d)  ~ ►  {0,  1}),  and  thus  specifies  with 
certainty  which  action  (alternative  of  D)  is  taken  for  each  parent  configuration 
(in  a  pure  policy,  only  one  6D(d,TTj(D))  for  each  7r j(D)  will  be  non-zero  as  they 
sum  1).  A  strategy  A  is  a  set  of  policies  {<fo  :  D  G  V},  one  for  each  decision 
node  of  the  diagram.  A  pure  strategy  is  composed  only  by  pure  policies. 

The  expected  utility  EU( A)  of  a  strategy  A  is  evaluated  through  the  follow¬ 
ing  equation: 


[YlP^C^jiC^YlSDix^^fuinpiU))]  ,  (1) 

xG nx  \  C  D  u  ) 

where  Xc,  xe>  and  7 Tj>(U)  are  respectively  the  projections  of  x  in  f , 

^7r(C)>  Qdutt(D)  and  This  equation  means  that,  given  a  strategy,  its 

expected  utility  is  the  sum  of  the  utility  values  weighted  by  the  probability  of 
each  diagram  configuration  (for  all  configurations).  The  maximum  expected 
utility  is  obtained  over  all  possible  strategies: 

MEU=  max  EU(A). 
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Chances 

TO  GA  BC 

Decisions 
DGA  BB 

1 

1 

1 

1 

0 

1 

1 

1 

0 

1 

0 

1 

1 

0 

1 

0 

0 

1 

1 

1 

Table  1:  Example  of  database  for  LIMID  of  Figure  1  where  outcomes  of  the 
chance  and  decision  nodes  are  available. 


Optimal  strategy 
DGA  BB 

COA 

Utilities 

COB 

POG 

1 

1 

-100 

-50 

200 

? 

1 

0 

-60 

180 

1 

0 

-80 

0 

-30 

0 

0 

0 

0 

-50 

Table  2:  Example  of  database  for  LIMID  of  Figure  1  where  local  utility  rewards 
are  available  when  the  optimal  strategy  is  used. 


The  problem  of  strategy  selection  is  to  obtain  the  strategy  that  maximizes  its 
expected  utility,  that  is,  argmaxmaxA  EU(A). 

Algorithms  to  solve  the  strategy  selection  problem  assume  that  all  parame¬ 
ters  of  chance  nodes  and  utility  nodes  are  known  and  fixed.  However,  eliciting 
all  such  parameters  from  experts  is  a  hard  task,  and  usually  can  be  done  in 
small  sized  LIMIDs.  Next  section  address  this  problem,  providing  an  approach 
to  learn  parameters  from  past  experience. 


3  Learning  LIMIDs 

As  discussed,  we  use  a  database  containing  past  experience  to  learn  the  parame¬ 
ters  of  the  LIMID.  The  data  are  divided  into  two  parts:  DBc  is  a  set  of  samples 
of  the  chance  and  decision  nodes  (note  that  in  these  data,  the  decisions  do  not 
necessarily  represent  optimal  decisions);  DBx>  is  a  set  of  samples  containing  an 
optimal  strategy  (which  is  defined  by  a  set  of  optimal  decisions)  and  the  utility 
values  obtained  at  each  utility  node  when  the  optimal  strategy  is  employed.  Us¬ 
ing  the  LIMID  of  Figure  1,  examples  of  data  that  would  be  available  to  learn  its 
parameters  are  presented  in  Tables  1  and  2  (the  nodes  of  Figure  1  are  denoted 
by  their  initial  letters,  as  explained  before).  True  is  denoted  by  a  one,  and  false 
is  indicated  by  a  zero. 

Using  the  database  of  chance  and  decision  variables,  standard  techniques 
can  be  employed  to  find  the  most  probable  values  for  the  parameters  of  the 
chance  nodes,  denoted  P  =  {p(C\-n j(C))}\/c ■  One  way  to  quantify  the  result 
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is  by  the  log  likelihood  function  log(p(DBc  |P)).  Because  utility  nodes  have  no 
children,  chance  nodes  can  only  have  other  chance  nodes  and/or  decision  nodes 
as  parents.  Assuming  that  samples  are  drawn  independently  from  the  under¬ 
lying  distribution,  we  can  use  the  decomposition  property  of  the  log  likelihood 
function  to  maximize  log  YhjkP(xik\7Tij)nijk ,  where  Xn~  £  nijk  indicates 

how  many  elements  of  DBc  contain  both  Xik  and  7r,j,  where  7 r^-  £  is 

a  joint  configuration  of  the  parents  of  C/  and  can  involve  chance  and  decision 
nodes.  Maximum  likelihood  estimation  has  its  optimum  at  p(xik\'Xi:j)  =  ^'n  k ' 
One  may  also  use  a  Bayesian  Dirichlet  model  instead  of  maximum  likelihood 
estimation,  just  as  it  is  done  in  Bayesian  networks. 

3.1  Learning  utility  functions 

Our  goal  here  is  to  estimate  utility  function  fu  :  rL(E/)  - ►  Q  for  each  utility 
node  U,  based  on  past  optimal  decisions.  Note  that  our  database  D Bx>  contains 
only  optimal  strategies  and  their  utilities  at  that  moment.  If  we  assume  that 
DB p  contains  a  wide  range  (in  the  sense  of  covering  all  the  utility  space)  of 
decisions  and  corresponding  utility  values,  an  estimation  based  on  expectation 
suffices.  For  instance,  we  take  the  average  of  the  utility  values  that  appear  in 
DB p,  for  each  configuration  7 tj(U)  £  f L((y),  and  construct  a  set  of  constraints 
that  relate  the  utility  values  for  distinct  elements  70,(17)  as  follows: 

•  For  each  sample  l  =  (lv,lu)  hr  DB-p,  let  lu  (for  each  U  £  U)  be  the 
observed  utility  value  of  node  U  and  l-p  the  decision  values  of  sample  l. 
Build  the  constraint: 

\/U :  E\lu\lx,\  =  ^  fu(xUlx>)  ■  p(x\lx>),  (2) 

xenir(u)\v 

where  p(x\lu)  is  calculated  a  priori  given  that  all  elements  of  P  are  already 
known,  and  E\ljj\lx>]  accounts  for  the  expectation  of  Ijj  over  the  samples 
l  that  are  compatible  with  lx>. 

This  way  we  have  a  set  of  linear  constraints  to  define  the  utility  function  of 
each  node  U.  Note  that  if  a  utility  node  has  only  decision  nodes  as  parents 
(no  chance  node  as  parent),  then  Equation  (2)  simplifies  to  an  equality  without 
summation.  For  example,  if  we  take  the  first  line  of  Table  2,  it  implies  the 
following  constraints: 

—90  =  E[;COa|DGA=1]  =  /COa(DGA=1), 

-55  =  E[/Cob|BB=1]  =  /cob(BB=1).  (3) 

The  situation  becomes  more  difficult  when  the  optimal  strategy  is  only  par¬ 
tially  observed,  that  is,  missing  values  may  appear  in  the  optimal  decisions  of 
a  sample  in  DB p.  In  this  case,  we  proceed  in  a  similar  way,  but  considering 
all  possible  completions  of  the  data  [11].  Such  approach  leads  to  the  following 
constraints: 
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•  For  each  sample  l  =  (IdJu)  in  DB-p,  let  Id  =  with  Ij,  the  observed 

part  and  l 'v  the  missing  part.  Build  the  constraint: 

VZ7,  Id  :  E[Iu\Id\  < 

E  E  K  (l)-fu(xUl0DUl)-p(x\l^Ul)  \  <E[Iu\Id],  (4) 
xesin {u)\d  J 

where  Iv  {•)  is  an  indicator  function  that  is  one  only  when  l  is  the  com¬ 
pletion  of  l',  and  E[Iu\Id],  E[Iu\Id]  are  computed  as  the  minimum  and 
maximum  values  of  the  expectation  of  Ijj  considering  all  possible  com¬ 
pletions.  We  treat  Iv  {•)  as  additional  boolean  variables  of  the  problem, 
which  are  going  to  be  assigned  later  by  the  strategy  selection  procedure. 
Note  that  =  1- 

It  is  possible  to  use  such  idea  with  additional  boolean  variables  because  we 
resort  to  the  procedures  we  have  developed  for  strategy  selection  [5],  where  the 
problem  is  tackled  by  integer  programming  techniques  (and  thus  the  boolean 
variables  are  trivially  included  in  the  optimization  problem).  As  an  example, 
take  the  second  line  of  Table  2  with  respect  to  the  node  DGA.  As  it  is  missing, 
we  have  the  following  constraints: 

-90  <  ^  /iDGA(u)-/coA(DGA=u)<-60, 

we{0,l} 

0<  53  IiDGa(v)  '  /coa(DGA=u)  <  0.  (5) 

^e{o,i} 

where  I;DGA  is  treat  as  a  boolean  variable,  that  is,  depending  on  the  value 
assigned  to  it,  the  data  is  completed  in  a  different  way.  This  completion  is  auto¬ 
matically  conducted  by  the  optimization  of  Equation  (1)  subject  to  Equations 
(2)  and  (4),  so  the  same  algorithm  we  have  developed  before  [5]  is  used  to  select 
the  best  strategy  over  this  now  learned  LIMID. 

3.2  The  optimization  problem  of  the  EBO  example 

To  illustrate,  we  write  down  the  optimization  problem  of  the  example  in  Figure 
1.  The  utility  functions  are  divided  by  the  largest  value  in  the  database  so  they 
certainly  belong  to  the  interval  [0, 1].  This  division  does  not  affect  the  choice  of 
the  optimal  strategy  [5].  Decision  nodes  are  replaced  by  nodes  with  imprecise 
probabilities,  and  we  obtain  a  credal  network  where  we  maximize  the  sum  of 
the  marginal  probabilities  of  the  utility  nodes.  The  objective  function  is 

max  p(COA)  +p{COB)  +p{POG ) 

(here  we  use  the  notation  p(-)  instead  of  /(•)  because  we  are  treating  the  utility 
nodes  after  translating  them  into  probability  nodes)  subject  to  constraints  that 
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define  each  marginal  probability  p(COA),  p(COB)  and  p(POG).  To  create 
these  constraints,  we  run  a  symbolic  Bayesian  network  inference  for  each  of 
them.  The  constraints  for  p(COA)  and  p(COB)  are  very  simple: 

p{COA)  =  p(GOA|DGA=l)p(DGA=l)  +  p(GOA|DGA=0)p(DGA=0), 

p(COB)  =  p(COB\BB=l)p(BB=l)  +  p(COB\BB=0)p(BB=0), 

because  they  only  depend  on  one  other  variable.  Note  that  p(DGA=l),  p(DGA=0), 
p(BB=l),  and  p(BB=0)  that  appear  in  these  constraints  are  unknown  and  thus 
become  optimization  variables  in  the  bilinear  problem. 

To  write  the  constraints  for  p(POG),  we  need  to  choose  a  precedence  or¬ 
dering.  We  will  use  the  ordering  BB, BC ,  DGA,  GA ,  TO ,  POG  (variables  CO  A 
and  COB  do  not  appear  in  the  order  as  they  are  not  relevant  to  evaluate  the 
marginal  p(POG)).  Hence,  the  first  variable  to  be  processed  is  BB.  We  write 
a  constraint  that  relates  the  query  POG  and  probabilities  p(BB)  (which  are 
defined  in  the  network  specification): 

p(POG)  =  Y  p(bb  =  d )  •  p(POG\BB  =  d). 

de  {0,1} 

BB  now  appears  in  the  conditional  part  of  p{POG\d),  which  may  be  viewed 
as  an  artificial  term  in  the  optimization,  as  it  does  not  appear  in  the  network. 
Because  of  that,  we  must  create  constraints  to  define  p(POG\d)  in  terms  of  net¬ 
work  parameters  (for  all  categories  d  G  BB).  According  to  our  chosen  ordering, 
the  current  variable  to  be  processed  is  BC.  Thus, 

p(POG\BB=l)  =  Y  p(BC  =  c\BB=l)-p(POG\BC  =  c), 

c6{0,1} 

p(POG|BB=0)  =  Y  p(BC  =  c|BB=0)  •  p(POG\BC  =  c). 
ce{  0,1} 

Note  that  p(POG\c )  =  p(POG\c,  d)  (for  any  d),  so  we  use  the  simpler.  At  this 
stage,  our  query  is  conditioned  on  BC.  Following  the  same  idea,  we  process 
DGA,  obtaining 

p(POG|BC=l)  =  Y  p(DGA  =  d)-p{POG\BC=l,DGA  =  d), 
de{  0,1} 

p(POG\BC=0)  =  Y  P(DGA  =  d)  ■  p{POG\BC=0,  DGA  =  d) . 

de{  0,1} 

Now  the  current  variable  to  be  treated  is  GA,  and  our  query  is  conditioned  on 
BC,DGA,  that  is,  we  must  define  how  to  evaluate  p(POG\BC,  DGA)  for  all 
configurations.  Thus,  for  all  c  €  {0, 1}  and  d  €  {0, 1}: 

p(POG\BC  =  c,  DGA  =  d)  = 

Y  P{GA  =  P\BC  =  c,DGA  =  d)  ■  p(POG\BC  =  c,GA  =  c'). 

c'efo.i} 
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At  this  moment,  POG  is  conditioned  on  GA,  BC  in  the  artificial  term  p(POG\BC 
c,  GA  =  d)  ( DGA  is  not  present  in  the  artificial  term  as  GA,  BC  separate  POG 
from  DGA).  Now  we  process  TO:  for  all  d  €  {0, 1}  and  c  €  {0, 1} 

p(POG\BC  =  c,GA  =  d)  = 

Y  p(TO  =  d'\BC  =  c,  GA  =  d)  ■  p(POG\TO  =  c"). 

c"e{o,i} 

Note  that,  as  p{POG\TO  =  c")  equals  the  utility  function  of  POG  given  TO, 
which  is  specified  in  the  network,  we  can  stop  the  symbolic  elimination.  All 
artificial  terms  are  related  (through  constraints)  to  parameters  of  the  network. 
Besides  all  these  constraints,  we  also  include  simplex  constraints  to  ensure  that 
probabilities  sum  1.  Finally,  we  need  to  include  the  constraints  of  Equations 
(3)  and  (5)  (and  other  utility  constraints  that  we  omitted  for  easy  of  expose,  as 
explained  in  Section  3.1).  To  illustrate  using  the  same  notation,  we  rewrite  the 
Equations  (5)  here: 

-90  <  hOGA (1)  •  r  •  p(COA\DGA  =  1)  +  J,DGA (0)  •  r  •  p{COA\DGA  =  0)  <  -60, 

0  <  /;DGA(1)  •  r  •  p{COA\DGA  =  1)  +  I1dga( 0)  •  r  •  p(GOA|TGA  =  0)  <  0, 
and 

^dga(1)  +  ^dga(0)  =  1) 

where  r  is  the  maximum  utility  value  that  appears  in  the  whole  database. 

Because  we  have  a  collection  of  linear  and  bilinear  constraints,  non-linear 
programming  can  be  employed  [3].  It  is  also  possible  to  use  linear  integer  pro¬ 
gramming  [4]  after  an  additional  manipulation  of  the  constraints. 


4  Conclusion 

In  this  work  we  have  presented  a  complete  learning  procedure  for  LIMIDs.  To 
deal  with  chance  nodes,  a  standard  technique  from  machine  learning  is  em¬ 
ployed.  To  estimate  utility  functions,  we  make  use  of  a  database  containing 
past  decisions  and  rewards,  which  avoids  the  necessity  of  interviews  with  ex¬ 
perts  to  elict  the  utilities.  With  this  approach,  it  is  expected  that  the  model 
represents  the  corresponding  domain  with  more  precision.  The  procedure  cre¬ 
ates  a  set  of  constraints  that  define  the  plausible  utility  functions,  which  can 
be  later  processed  by  the  strategy  selection  method.  Hence,  this  work  provides 
a  learning  procedure  to  be  integrated  with  the  inference  methods  that  we  have 
developed  in  the  previous  years.  Altogether,  they  provide  a  concise  solution 
for  military  planning  using  the  general  Limited  Memory  Influence  Diagrams  as 
basis. 

Future  work  include  the  full  integration  of  the  learning  procedure  described 
here  with  the  inference  procedures  already  developed,  extension  of  these  ideas 
to  structure  discovery  of  utilities  [2],  and  the  exploration  of  new  learning  and 


strategy  methods.  For  instance,  it  is  known  that  both  learning  and  inference 
methods  can  be  performed  very  efficiently  (polynomial  time)  in  tree-shaped  dia¬ 
grams.  We  have  already  started  to  study  a  broader  version  of  influence  diagrams 
that  extends  the  trees  but  are  more  restricted  than  LIMIDs.  It  is  mainly  a  type 
of  LIMID  where  decision  nodes  are  related  by  a  tree  structure,  while  chance 
nodes  are  free  to  happen  anywhere.  The  advantages  of  such  intermediate  model 
is  the  possibility  of  keeping  the  computations  tractable  (we  have  proved  that 
LIMIDs  can  be  processed  up  to  a  couple  of  hundreds  of  nodes,  but  even  larger 
domains  are  much  more  time  consuming  and  need  efficient  algorithms). 
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Validation  of  Cassio's  software 


Geng  Li 

1.  Introduction 

In  this  first  part  of  this  report,  I  provided  three  test  examples  used  for  the  validation  of  Cassio's 
software.  The  first  example  is  randomly  created  from  the  reality,  the  second  example  comes  from 
the  high  cited  paper  "Representing  and  Solving  Decision  Problems  with  Limited  Information"  and 
the  third  example  is  a  small  EBO  net  from  the  project. 

In  the  second  part  of  this  report,  I  implemented  Brutal  Force  method  based  on  BNT  to  enumerate 
all  possible  strategies  in  Cassio's  example  and  computed  the  expected  utility  for  all  the  strategies 
accordingly.  Then  I  compared  the  results  to  the  outcomes  by  SPU  and  CR.  All  tests  proved  that 
Cassio's  method  will  converge  to  global  maximum  solution,  and  SPU  will  only  converge  to  local 
maximum. 

2.  Validation  of  the  Implementation  of  the  software 

For  Cassio's  software,  it  right  now  could  deal  with  LIMIDs  version  of  the  diagram.  For  the  Ge/V/e(A 
software  developed  by  University  of  Pittsburgh),  it  could  handle  the  traditional  ID.  So  the  reason 
why  Cassio's  example  could  not  be  validated  in  GeNIe  is  that  the  order  of  decision  nodes  in  the 
example  is  not  specified.  Additionally,  if  we  want  to  test  Cassio's  example  in  GeNIe,  we  have  to 
explicitly  make  arcs  from  predecessors  of  decision  nodes.  It  will  create  too  many  links  in  the  large 
networks  and  it  is  not  operable  in  reality.  So  we  could  not  use  GeNIe  that  could  only  handle 
traditional  ordered  IDs  to  handle  Cassio's  EBO  example  which  is  a  LIMIDs  with  no  specified  order. 
But  we  could  test  a  certain  LIMIDs  that  has  specified  order  and  explicit  links.  Because  by  explicitly 
making  arcs,  a  certain  diagram  could  be  viewed  as  either  a  tradition  influence  diagram  or  a 
LIMIDs.  They  would  get  the  same  result.  This  is  what  I  did  in  the  previous  report.  The  three 
examples  are  showed  in  details  as  follows: 

(1)  Example  1 

EX1:  John  now  is  close  to  graduate.  Now  he  has  two  choices ,  one  is  to  continue  his  postdoc 
research  and  the  other  is  job-hunting.  But  before  making  his  decisions,  he  wants  to  refer  to  the 
economical  situation  and  thus  consult  to  the  specialist.  The  consulting  fee  is  about  10  dollars.  If 
the  economical  situation  is  good,  the  probability  that  the  specialist  makes  a  promising  prediction 
is  0.9  and  if  bad,  the  probability  that  the  specialist  makes  a  promising  prediction  is  0.1.  The 
benefit  of  continuing  postdoc  is  1000  dollars  no  matter  what  current  situation  is.  However,  if  the 
economical  situation  is  good,  the  average  benefit  of  getting  a  job  is  5000  dollars  and  if  the 
economical  situation  is  bad,  the  average  benefit  of  getting  a  job  is  only  100  dollars. 

The  CPDs  for  the  example  are  displayed  in  Figure  1,  and  the  diagrams  established  by  Cassio's 
software  and  GeNIe  are  showed  in  Figure  2  and  Figure  3  respectively: 
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Figure  2  The  diagram  created  by  Cassio's  software 


Figure  3  The  diagram  created  by  GeNIe 


(i)Software  validation 

The  calculated  MEU  in  both  softwares  are  showed  in  Figure  4,  we  see  that  both  softwares  get  the 
same  MEU  1269(  also  in  Cassio's  software,  two  methods  SPU  and  CR  got  the  same  result).  In 
Figure  5,  the  optimal  strategies  selected  in  two  softwares  are  the  same  as  well.  Note  that  in 
GeNIe,  the  bold-faced  numbers  denotes  the  optimal  action  to  take  given  its  configuration  of 
parents.  In  Cassio's  software,  the  green  rectangle  denotes  the  optimal  policy  to  take. 


2J  Inference  Solution 
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> 

Figure  4  The  upper  result  is  from  GeNIe,  and  the  lower  result  is  from  Cassio's  software 
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Figure  5  The  optimal  strategies  from  the  two  softwares 
(ii)Manual  Calculation 

Now  I  shall  manually  calculate  the  above  example  and  validate  the  software.  The  expected  utility 
of  the  whole  diagram  is  calculated  using: 

EU(Dc)  =  YjmaxYJP(ES)P(R\ES,Dc){Ucost(Dc)  +  U2(ES,Dp  or  y) 

R  Dp-"-J  ES 

The  main  of  the  SPU  is  that  keeping  all  other  policies  unchanged  and  we  only  calculate  one  policy 
each  time  to  get  a  new  MEU.  Then  we  replace  the  new  optimal  policy  with  the  original  one. 


We  initially  assume  that  the  original  strategies  are  consult  and  postdoc.  According  to  SPU 
algorithm,  for  the  decision  node  "Postdoc_or_Job_hunting": 


EU {Dp  or  j  = '  Job '  |  Dc  = '  consult  \R  =  '  promi  sin  g ') 

=  ^P{ES)P{R  = '  promi  sin  g '  |  £5,  D(  =  ’  consult ')  {Ucosl  (Dc  = '  consult ') + UHenefit  {ES,  DPorJ  =  ’  Job ')} 

=  0.1  *0.9  *{-10 +  5000} +  0.9  *0.1  *{-10 +  100} 

=  457.2 

Similarly,  we  calculated  other  expected  utility  of  the  decision  node  "Post_orJob"  considering 
each  of  its  parent's  configurations: 

EU{DP  or  j  =’ Job' \DC  =' consult \R  =  ' desperate')  =  121. 9 
EU{DP  or  j  =  ’  Job ’|  Dc  =' consult  ',R  =  'no _ idea  ’)  =  impossible 

EU {Dp  ur  j  =  ’  Post '  |  Dc  =  ’  consult  \R  =  '  promi  sin  g  ’) 

=  ^ P{ES)P{R  =  ’  promi  sin  g  ’  |  £5’,  £>c  =  ’  consult  ’)  {C/cos<  (Dc  =  ’  consult ') + UBeneflt  {ES,  Dp  or  j  =  ’  Ats  / ’)  } 

ES 

=  0.1  *0.9  *{-10 +  1000} +  0.9  *0.1  *{-10 +  1000} 

=  178.2 

EU{DP  or  j  = '  Post '  Dc  = ' consult  \R  =  ' desperate')  =  811.8 
EU {Dp  or  j  =' /^as/ ' | Dc  = ' consult \R  =  ' no _ idea ')  =  impossible 


Because  if  Dp  or  j  =' Post'  no  matter  what  the  result  of  R  is,  the  EU  of  the  strategies  is 

178. 2 +  811. 8  =  990.0 ,  which  is  less  than  457. 2 +  811. 8  =  1269.0 .  Thus  given  the  decision 
consult  for  the  decision  node  "Consult",  if  the  "Report"  is  promising,  the  policy  for  the  node 
"Post_orJob"  is  to  go  and  find  a  job.  Otherwise,  the  policy  for  the  node  "Post_orJob"  will 
remain  unchanged.  We  update  our  policy  and  now  we  could  use  this  result  to  calculate  the 
expected  utility  of  the  decision  node  "Consult". 

EU  {Dc  = '  Consult ') 

=  Z„miKZP+S)P(+  ES'Dc){Uml(Dc)+U2(ES,Dp  ,) 

R  Dr-°r-J  ES 

=  457.2  +  811.8  =  1269.0 

Similarly, 

EU{DC='  not  _  Consult  ’)  =  0. 1  *  1  *  (0  + 1 000)  +  0.9  *  1  *  (0  + 1 000)  =  1 000 

Thus  the  optimal  policy  for  Dc  remains  unchanged,  that  is,  to  consult.  For  the  second  round 

calculation  using  SPU  method,  the  strategies  would  not  change  because  EU  will  no  longer  change. 

Therefore,  MEU  =  max EU{DC)  =  1269.0 

A : 

From  above,  we  calculated  the  MEU  for  the  decision  node  "Consult",  which  is  also  the  MEU  for 
the  whole  diagram,  is  1269.0.  Thus,  we  see  that  Cassio's  method  got  the  correct  result. 


(2)  Example  2 


Ex  2.  Another  example  comes  from  the  paper  "Representing  and  Solving  Decision  Problems  with 
Limited  Information"  by  Steffen  L.  Lauritzen  and  Dennis  Nilsson  in  2001.  Due  to  the  example 
already  mentioned  in  my  previous  report ,  /  will  only  present  the  graph  model  and  the  calculated 
result  here. 

Figure  6  shows  the  graph  created  by  Cassio's  software.  Flere  I  replaced  the  original  code  by  the 
updated  version  from  the  latest  Cassio's  email.  The  calculated  MEU  by  SPU  and  CR  is  shown  is 
figure  7  and  we  see  that  Cassio's  method  gets  the  same  result  with  the  paper  "Representing  and 
Solving  Decision  Problems  with  Limited  Information"  in  Figure  8. 
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Figure  6  The  Pigs  model  created  using  Cassio's  software 


Figure  7  The  calculated  MEU  from  Cassio's  software 


Table  2 

Expected  Utilities  of  Select 

m  Strategies  in  Pigs 

Always 

Uniform  Never 

Direct 

Best  LIMID 

Best  ID 

586 

644  669 

718 

727  / 

729 

Figure  8  The  result  from  the  cited  paper 


(3)  Example  3 


Ex3.  In  Figure  9,  I  manually  created  a  median-size  network  which  is  similar  to  Cassio's  example  of 
EBO  net.  The  reason  why  I  created  the  median  network  because  this  free  software,  Hugin,  has 
the  limitation  of  the  number  of  the  nodes.  But  anyway,  the  two  software  got  the  same  MEU 
result.  Figure  10  shows  the  calculated  MEU  is  790.89  by  using  Cassio's  software  from  SPU  and  CR 
methods  and  Figure  11  shows  the  same  result,  that  is,  MEU  equals  to  790.89. 


Figure  9  The  diagram  created  by  Flugin 
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Figure  10  The  diagram  created  by  Cassio's  software 


3  Inference  Solution 

MEU  value  is  790.89 

OK 

Figure  11  MEU  calculated  using  Cassio's  software 
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Figure  12  MEU  calculated  using  Hugin 


3.  Comparison  of  SPU  CR 

In  this  part,  I  implemented  a  Brutal  Force  method  which  enumerates  all  possible  strategies  and 
got  all  the  EU  results. 


(l)Original  example 


In  Cassio's  example  in  Figure  13,  there  are  totally  11  decision  nodes.  Each  node  has  two  choices, 
that  is,  "take"  and  "do  not  take".  So  the  number  of  all  possible  strategies  is  211  =2048.  My 
program  will  traverse  all  these  2048  strategies  and  selected  the  largest  ten  MEU. 


In  the  original  example  provided  by  Cassio  and  already  specified  parameters,  the  largest  ten  EU 
calculated  by  brutal  force  method  are  shown  in  Table  1.  Also  I  ran  the  software  and  the  MEU 
calculated  by  CR  and  SPU  are  156.4  and  -55.28  respectively  that  are  shown  in  Figure  14  and 
Figure  15.  From  the  table,  we  clearly  see  that  the  result  of  CR  method  is  exactly  the  same  with 
the  first  item  of  the  table.  This  proves  that  in  this  example  CR  method  converged  to  the  global 
maximum.  Also  the  result  of  SPU  matches  the  ninth  item  of  the  table  which  denotes  that  SPU  will 
only  converge  to  the  local  maximum. 


NO. 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

MEU 

156.4 

68.97 

55.26 

12.89 

-7.11 

-20.59 

-34.21 

-51.26 

-55.28 

-61.03 

Table  1  The  largest  ten  EU  in  Cassio's  example 


Figure  14  SPU  result 


Figure  15  CR  result 


(2)  Change  the  utility  value  of  the  node  "CostDestroyC2" 

In  the  second  test,  I  changed  the  utility  value  of  the  node  "CostDestroyC2"  from  [-20,  0]  to  [-2000, 
0],  This  is  shown  in  Figure  16. 


Figure  16  The  utility  value  of  the  node  "CostDestroyC2" 

Then  I  conducted  brutal  force  method  for  this  example  and  the  largest  ten  EU  results  are  in 
shown  in  table  2.  Note  that  different  strategies  could  lead  to  the  same  MEU.  The  result  by  SPU 
and  CR  are  shown  in  Figure  17.  This  time  they  both  got  the  same  result,  MEU  -232.11.  Referring 
to  table  2,  this  is  reasonable  because  both  of  them  converge  to  the  global  maximum. 


NO. 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

MEU 

-232.1 

-252.1 

-252.1 

-252.1 

-272.1 

-272.1 

-278.8 

-292.1 

-306.1 

-312.3 

Table  2  The  largest  ten  EU 


Figure  17  the  result  of  SPU  and  CR 
(3)  Change  CPT  of  the  chance  node  "Flypothesis" 

In  the  third  test,  the  CPTs  of  the  chance  node  "Flypothesis"  before  and  after  the  change  are 
shown  in  Figure  18  and  Figure  19. 


Figure  18  The  CPT  of  "Flypothesis"  before  the  change 


Figure  19  The  CPT  of  "Flypothesis"  after  the  change 


Using  Brutal  Force  algorithm,  the  calculated  ten  largest  MEU  are  shown  in  table  3.  The  calculated 


MEU  by  SPU  and  CR  method  are  shown  in  Figure  20  and  Figure  21.  CR  method  got  the  largest 
MEU  298.3  and  SPU  only  got  MEU  253.4.  From  the  result,  we  see  that  again  CR  method  will 
converge  to  the  global  maximum  and  SPU  will  only  converge  to  the  local  maximum. 


NO. 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

MEU 

298.3 

278.3 

278.3 

278.3 

258.3 

258.3 

253.4 

248.3 

233.4 

233.4 

Table  3  The  largest  ten  EU 


Figure  20  SPU  method 


Figure  21  CR  method 


4.  Conclusion 

In  the  first  part  of  this  report  I  created  three  examples  and  used  them  to  validate  Cassio's 
software.  The  calculated  results  are  compared  to  the  existing  software  and  the  paper.  We 
conclude  that  Cassio's  implementation  got  all  the  correct  results. 

In  the  second  part  of  the  report,  we  conclude  that  CR  performs  better  than  SPU  because  it  will 
converge  to  the  global  maximum  solution.  But,  SPU  is  a  good  approximate  method  and  for  many 
networks,  it  is  expected  that  it  will  achieve  the  global  optimal  solution  as  well.  But  it  is  not  always 
the  case,  just  as  shown  in  the  first  and  third  test  of  this  part.  SPU  is  some  kind  of  like  Nash 
Equilibrium,  which  means  that  all  its  policies  are  local  maximum.  Flere,  a  strategy  is  a  local 
maximum  strategy  means  that  the  expected  utility  does  not  increase  by  changing  only  one  of  its 
policies.  Thus,  SPU  will  only  guarantee  to  converge  to  the  local  maximum  solution. 
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Abstract 

An  Influence  Diagram  is  an  interesting  probabilistic  graphical  framework 
for  modeling  effects-based  operations  (EBO),  as  they  provide  a  compact  rep¬ 
resentation  of  the  domain  and  useful  properties  for  decision  making.  It  is 
known  that  accuracy  of  decisions  relies  on  the  quality  of  Influence  Diagram 
parameters.  On  the  one  hand,  learning  reliable  parameters  of  such  models  of¬ 
ten  requires  a  relative  large  amount  of  quantitative  training  data,  which  may 
not  exist,  may  be  hard  to  acquire  and/or  may  contain  missing  values.  On  the 
other  hand,  qualitative  knowledge  information  is  usually  available,  and  incor¬ 
porating  such  knowledge  can  improve  parameter  estimation  and  the  accuracy 
of  decisions.  This  report  describes  a  framework  based  on  convex  optimiza¬ 
tion  to  incorporate  many  types  of  qualitative  relations  about  parameters  with 
quantitative  training  data  to  perform  parameter  estimation  in  an  Influence 
Diagram  for  the  EBO  planning  problem.  Experiments  and  examples  using 
synthetic  data  indicate  the  benefits  of  this  framework. 


1  Introduction 

The  Effects-based  operations  (EBO)  approach  for  military  planning  seeks  for  a 
campaign  objective  by  considering  direct,  indirect  and  cascading  effects  of  mili¬ 
tary,  diplomatic,  psychological  and  economic  actions  [6,  9].  Graphical  models  such 
as  Influence  Diagrams  are  specially  interesting  while  modeling  such  a  domain,  as 
we  can  specify  many  actions  and  factors,  uncertainties  and  their  dependencies.  We 
employed  a  graphical  model  called  Influence  Diagram,  which  comprises  decision, 
uncertainty  and  utility  information  in  a  compact  graphical  structure  for  decision 
making  [11,  19,  23]. 
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Influence  Diagrams  are  described  through  graphs  where  value  nodes  are  re¬ 
lated  to  profits  and  costs  of  actions,  chance  nodes  represent  uncertainties  and  de¬ 
pendencies  in  the  domain  and  decision  nodes  represent  actions  to  be  taken.  After 
parametrization,  this  model  can  be  used  to  evaluate  plans  (expected  utility)  and 
strategy  decisions  (choice  of  which  action  to  take).  Parameter  learning  is  the  prob¬ 
lem  of  estimating  probability  measures  of  conditional  probability  distributions  re¬ 
lated  to  the  chance  nodes,  given  the  graph  structure. 

Many  parameter  learning  techniques  depend  heavily  on  training  data.  Ideally, 
with  sufficient  data,  it  is  possible  to  learn  parameters  by  standard  statistical  analy¬ 
sis  like  maximum  likelihood  estimation.  In  many  real-world  cases,  however,  data 
are  either  incomplete  or  sparse,  which  can  cause  inaccurate  parameter  estimation. 
Incompleteness  means  that  some  parameter  values  are  missing  in  the  data,  while 
sparseness  means  that  the  amount  of  training  data  is  small.  EBO  military  planning 
is  a  domain  with  such  characteristics,  as  the  world  history  has  not  too  many  mil¬ 
itary  campaigns.  Furthermore,  data  about  these  conflicts  may  be  sensitive/secret 
and  may  not  be  available. 

Even  with  incomplete  and  sparse  data,  general  qualitative  knowledge  about  the 
domain  is  usually  available,  and  such  knowledge  might  be  employed  to  improve 
parameter  estimations.  While  the  previous  report  [  1 2]  has  focused  mainly  on  model 
evaluation  and  has  supposed  that  parameters  were  known  in  advance,  we  propose 
here  a  framework  based  on  non-linear  convex  optimization  to  solve  probabilistic 
parameter  learning  problem  by  combining  quantitative  data  and  domain  knowledge 
in  the  form  of  qualitative  constraints.  Hence,  this  report  is  complementary  to  [12] 
in  the  sense  that  parameter  learning  is  the  main  focus. 

Many  types  of  qualitative  constraints  are  discussed,  including  range  and  re¬ 
lationship  constraints  [16],  influences  and  synergies  [25,  26],  non-monotonic  con¬ 
straints  [24],  weak  and  strong  qualitative  constraints  [4,  20].  They  are  applied  to  an 
EBO-based  planning  problem  and  experiments  indicate  that  qualitative  constraints 
can  strongly  reduce  the  amount  of  data  necessary  for  effective  parameter  learning. 

Section  2  comments  on  some  related  work  about  parameter  learning  with  qual¬ 
itative  knowledge.  Section  3  introduces  our  notation  for  Influence  Diagrams  and 
the  problem  of  probabilistic  parameter  learning.  Section  4  details  a  collection  of 
qualitative  constraints  that  can  guide  the  learning  process.  Then  we  describe  a  pro¬ 
cedure  to  solve  parameter  learning  by  reformulating  the  problem  as  a  constrained 
convex  optimization  problem  (Section  5).  Section  6  presents  some  experiments 
with  synthetic  data,  and  finally  section  7  concludes  the  report  and  indicates  future 
work. 
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2  Related  Work 


Domain  knowledge  can  be  classified  as  quantitative  and  qualitative,  which  de¬ 
scribes  the  explicit  quantification  of  parameters  and  approximate  characterizations, 
respectively.  Both  are  useful  for  parameter  learning,  but  quantitative  knowledge 
has  been  widely  used  while  qualitative  relations  among  parameters  have  not  been 
fully  exploited.  We  focus  our  attention  on  related  work  about  qualitative  relations. 
Parameter  learning  in  general  is  a  well  explored  topic  and  we  suggest  Jordan’s  book 
[  1 3]  as  a  starting  point  for  a  broader  review. 

Concerning  the  use  of  qualitative  relations,  Wittig  et  al.  [27]  present  a  method 
to  integrate  qualitative  constraints  into  two  learning  algorithms  (APN  and  EM) 
by  introducing  violation  functions  as  penalty  terms  to  the  log  likelihood  function. 
They  show  that  domain  knowledge  in  the  form  of  constraints  can  improve  learn¬ 
ing  accuracy.  However,  weights  for  the  penalty  functions  often  need  be  manually 
tuned,  which  strongly  rely  on  human  knowledge  about  such  weights.  Altendorf  et 
al.  [1]  describe  a  method  to  incorporate  monotonicity  constraints  into  the  learn¬ 
ing  algorithm.  It  is  based  on  the  assumption  that  states  of  discrete  random  vari¬ 
ables  can  be  totally  ordered,  as  we  also  do.  Additionally,  it  uses  penalty  functions, 
which  suffer  from  the  same  problem  as  [27].  Feelders  and  Van  der  Gaag  [10]  in¬ 
corporate  some  simple  inequality  constraints  in  the  learning  process.  They  assume 
that  all  the  variables  are  binary.  Moreover,  the  constraints  used  in  ah  above  meth¬ 
ods  [  1 ,  10,  27]  are  too  restrictive,  because  constraints  can  not  involve  parameters 
among  many  distributions  and  parameters  can  not  appear  in  as  many  constraints  as 
we  desire. 

de  Campos  and  Cozman  [7]  formulate  the  learning  problem  as  a  constrained 
optimization  problem.  However,  they  are  restricted  to  complete  datasets  and  apply 
non-convex  optimization.  Niculescu  et  al.  [16,  17]  also  solve  the  learning  problem 
by  optimization  techniques.  They  derive  closed  form  solutions  for  the  maximum 
likelihood  estimation  supposing  some  predefined  types  of  constraints,  much  like 
the  work  presented  here.  However,  there  are  two  main  limitations  of  their  meth¬ 
ods:  there  is  no  overlaping  between  parameters  of  different  constraints  in  general, 
that  is,  parameters  are  restricted  to  appear  in  only  one  constraint  at  all  and  con¬ 
straints  are  restricted  to  single  distributions.  There  are  very  restricted  cases  where 
parameters  and  constraints  can  involve  distinct  distributions.  We  describe  a  general 
learning  procedure  where  such  limitations  do  not  exist. 

Similar  ideas  are  proposed  by  Xue  and  Ji  [29].  They  derive  closed  form  so¬ 
lutions  for  the  maximum  likelihood  estimation  and  the  EM  algorithm  supposing 
some  predefined  types  of  constraints.  We  emphasize  that  the  work  presented  here 
is  more  general,  because  we  allow  potentially  any  convex  constraint  among  pa¬ 
rameters  to  be  included  in  the  model  and  we  guarantee  convergence  to  a  global 
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Figure  1:  Simple  Influence  Diagram  example. 


optimum  of  the  problem.  Such  properties  were  not  obtained  before. 

3  Problem  definition 

An  Influence  Diagram  is  a  directed  acyclic  graph  where  nodes  may  have  three  dis¬ 
tinct  types:  chance,  decision  and  value  nodes.  Links  between  nodes  characterize 
conditional  dependencies  among  them.  Explicitly,  links  toward  a  chance  node  in¬ 
dicate  probabilistic  dependency  of  the  node  on  its  parents;  links  to  a  decision  node 
indicate  what  information  is  available  to  take  the  decision,  and  links  to  value  nodes 
represent  a  utility  for  the  parents  (value  nodes  cannot  have  children).  Associated 
to  each  node,  there  are  parameters: 

1.  Chance  nodes :  for  each  such  node,  there  is  an  associated  discrete  random 
variable  Xi  with  domain  Qx,  =  x\ .....  ■x[,:  and  conditional  probability  dis¬ 
tributions  p(Xi\-7r(Xi)),  where  7t(X,)  are  the  parents  of  X,  in  the  graph.  In 
fact  there  must  exist  one  distribution  for  each  ttj  (X,,  ),  where  j  is  viewed  as 
a  shortcut  for  a  complete  configuration  of  the  parents  of  Xi,  that  is,  ttj  (X,) 
defines  a  set  (xk^,xkC, ...,xk*),  where  t  =  |7r(Xj)|,  Xia  €  tt  (Xi)  and 
ka  G  {1 , . . . ,  r.la  }  for  a  =  1, ...  ,t.  If  a  variable  has  no  parents,  then  we  de¬ 
fine  j  equals  to  zero.  Whenever  necessary  and  for  ease  of  expose,  we  use  an 
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extended  notation  j  =  {:£■' ,  x^, . . . ,  x^' ) .  Therefore,  x-'  represents  the  /;:th 
state  of  random  variable  Xj,  and  ttj  (Xj)  is  the  jth  parent  configuration  of 
it.  Furthermore,  let  X  =  {Xi, . . .  ,  Xn}  be  the  set  of  chance  nodes/random 
variables  (we  use  them  interchanged  if  no  confusion  arises)  and  q,  the  num¬ 
ber  of  distinct  configuration  of  7r(Xj)  (that  is,  qt  =  0  xten(x)  rt)- 

2.  Decision  nodes',  for  each  such  node,  there  is  a  decision  variable  Yt  with  do¬ 
main  Qvi  =  yj ,  •  •  • ,  yf  ■  Besides  that,  the  configuration  of  7r  ( Yj )  is  available 
at  the  time  of  the  decision.  A  policy  for  Yj,  denoted  5t,  specifies  the  action  to 
take,  that  is,  8i  £  Dyt.  We  denote  by  y  the  set  of  decision  nodes,  and  by  6  a 
set  of  policies  one  for  each  decision  node  Y,  (this  is  also  called  strategy). 

3.  Value  nodes :  for  each  such  node,  there  is  a  function  gjj%  :  ►  Q, 

where  Yl^Ui)  's  the  set  of  all  possible  configuration  of  tt ( Uj )  and  Q  are  the 
rational  numbers.  gui  defines  the  utility  for  each  configuration  of  the  parents. 
We  define  U  as  the  set  of  all  value  nodes. 

A  simple  example  is  depicted  in  Figure  1 .  Decision  nodes  are  represented  by 
rectangles,  chance  nodes  by  ellipses  and  value  nodes  by  diamonds. 

Example  1  Using  the  Influence  Diagram  of  Figure  1,  we  can  model  Action  1  as  a 
ground  attack  to  enemy  positions,  Action  2  as  bombing  such  positions.  Value  1  as 
the  cost  of  a  ground  attack,  and  Value  2  as  the  cost  of  bombing.  X2  is  our  goal,  for 
instance  to  achieve  territory  occupation,  while  Value  3  is  the  profit  of  the  goal.  X\ 
and  X3  represent  the  uncertain  outcome  of  Action  1  and  Action  2,  respectively. 

Given  an  strategy  5  and  a  complete  configuration  for  the  random  variables  x  = 
(x'| 1 , . . . ,  x;^" ) ,  we  can  compute  the  joint  probability  as  follows 

n 

ps{x.)  =  Y\_p(x^i\Trj(Xi))I(xi  U  7 Tj(Xi)  CxU(5), 

i=  1 

where  'I  is  an  indicator  function  determining  that  xf  and  tt:1  (Xj)  must  cope  with  the 
configuration  of  x  and  <5.  If  we  see  8  as  evidence,  then  this  formula  is  just  similar 
to  the  joint  probability  of  well-known  Bayesian  networks.  Marginal  probabilities 
can  be  obtained  summing  over  the  undesired  variables,  that  is,  if  x'  C  x,  then 

ps(x)  =  p 5(x)- 

x"£  f2x\x/ 

For  a  given  strategy,  the  main  inference  in  an  Influence  Diagram  is  to  evaluate 
its  expected  utility,  which  is  obtained  using  a  weighted  sum  of  corresponding  utility 
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functions: 


E5  =  EE  Ps(^J  (U))gu('^:1  (U)), 

U&A  j 

where  j  ranges  in  all  possible  parent  configurations  of  U. 

Note  that  inferences  in  Influence  Diagrams  rely  on  the  quality  of  the  proba¬ 
bilistic  parameters  to  weight  the  utility  functions.  This  report  focuses  on  learning 
of  such  parameters.  Hence,  the  graph  structure  and  the  utility  functions  are  sup¬ 
posed  to  be  known  in  advance.  To  ease  of  expose,  we  define  6  as  the  entire  vector 
of  probabilistic  parameters,  where  =  p(x^\ttj (Xj)).  Moreover,  we  need  to 
specify  an  order  for  the  states  of  each  random  variable,  that  is,  for  each  Xt,  we 
define  an  order  xf1  <  x\2  <  ...  <  xf  where  {k\, ...,  kn}  C  {1  ,  ...,ry}. 
Without  loss  of  generality,  we  suppose  that  ks  =  s  for  all  s  €  {  I ,  r, }  (if  nec¬ 
essary,  we  could  exchange  the  position  of  some  states  to  comply  with  this  rule). 
Thus  we  have  xj  <  xf  <  . . .  <  for  all  Xt. 

3.1  Learning  probabilistic  parameters 

Given  a  dataset  D  =  {D\, . . .  ,  Djy},  with  Dt  =  (x^, . . . ,  x^nt .  5)  a  sample  of 
the  chance  and  decision  nodes,  the  goal  of  parameter  learning  is  to  find  the  most 
probable  values  for  0.  These  values  best  explain  the  dataset  D,  which  can  be 
quantified  by  the  log  likelihood  function  log(p(D\8)),  denoted  Lr>{6).  Assuming 
that  samples  are  drawn  independently  from  the  underlying  distribution,  we  have 

n  qi  ri 

M0)=iognnncf’  w 

i—lj—l  k—1 

where  indicates  how  many  elements  of  D  contain  both  x^  and  n:l  ( Xr ) . 

If  the  dataset  D  is  complete,  Maximum  Likelihood  (ML)  estimation  method 
can  be  described  as  a  constrained  optimization  problem,  i.e.  maximize  Equation 
(1),  subject  to  simplex  equality  constrains: 

max  Ld(0) 

s.t.  Vi=i,...„Yj=i...9i  gij(0)  =  J2k=i  °ijk  -1  =  0  (2) 

where  gij(0 )  =  0  imposes  that  distributions  defined  for  each  variable  given  a 
parent  configuration  sums  one  over  all  variable  states.  This  problem  has  its  global 
optimum  solution  at  6^  =  where  nt-j  =  r.  THjk- 

4  Qualitative  Constraints 

Standard  likelihood  estimations  are  usually  enough  if  we  have  enough  data,  re¬ 
gardless  one  uses  prior  knowledge  or  not.  However,  when  small  amount  of  data 
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is  available,  the  likelihood  function  may  not  produce  reliable  estimations  for  the 
parameters. 

Example  2  Suppose  we  are  working  with  the  chance  nodes  X  \ ,  Xi ,  X3  of  Ex¬ 
ample  1,  and  they  are  associated  to  binary  random  variables  (with  categories  xj 
meaning  true  and  x~  meaning  false).  Suppose  further  that  we  have  the  dataset 
D  =  (D i,  D2),  with  D\  =  (x{,  xi,,  £3)  and  D2  =  (x\ ,  x$,  £3).  Using  the  ML  esti¬ 
mation,  we  have  the  posterior  probabilities  #101  =  $102  =  #301  =  $302  =  0.5  and 
02ni  =  02j22  =  1.  with  j\  =  (x\,x\),  j2  =  (x^xj),  j3  =  (xl,x\),  j4  =  ( x\,x% ). 
Posterior  probability  distributions  02j3k  cmd  @2j4k  con  not  be  estimated  as  no  data 
about  such  configurations  are  available. 

Situations  like  in  Example  2  could  be  alleviated  by  inserting  quantitative  prior 
distributions  for  the  parameters.  However,  acquiring  such  quantitative  prior  infor¬ 
mation  may  not  be  an  easy  task.  An  incorrect  quantitative  prior  might  lead  to  bad 
estimation  results.  For  example,  standard  methods  apply  quantitative  uniform  pri¬ 
ors.  In  this  case,  if  no  data  are  present  for  a  given  parameter,  then  the  answer  would 
be  0.5,  which  may  be  far  from  the  correct  value. 

A  path  to  overcome  this  situation  is  through  qualitative  information.  Qualita¬ 
tive  knowledge  is  likely  to  be  available  even  when  quantitative  knowledge  is  not, 
and  tends  to  be  more  reliable.  For  example,  someone  hardly  will  make  a  mis¬ 
take  about  the  qualitative  relation  between  sizes  of  the  Earth  and  the  Sun;  almost 
everyone  will  fail  to  specify  a  quantitative  ratio  (even  approximate). 

Example  3  Suppose,  in  addition  to  Example  2,  that  the  following  two  constraints 
are  known:  $302  +  @2j3i  <  0.7  and  #2jii  <  $2j42-  With  this  knowledge,  we  can 
state  that  02j3\  f  0.2  and  6*274  2  =  1.  reducing  the  space  of  possible  parameters 
and  alleviating  the  problem  with  sparse  quantitative  data. 

We  start  by  describing  constraints  that  appear  in  Qualitative  Probabilistic  Net¬ 
works  [25].  The  most  common  type  of  qualitative  constraints  is  called  influence. 
Qualitative  influences  define  some  knowledge  about  the  state  of  a  variable  given 
the  state  of  another.  For  example,  the  probability  of  achieving  territory  occupation 
given  that  a  successful  ground  attack  was  employed  is  greater  than  when  it  was  not. 


Definition  1  Let  X„ ,  A'/,  be  variables  such  that  Xa  €  ir (Xb).  We  say  that  Xa 
influences  Xf,  positively  if 

'dka>k'a,kb=l,...,rb-l  ®bjkak  >  ®bik’ak  @) 

k>kb  k>kb 
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where  j *.  =  (x% ,  7TJ*  (X&) )  and  j*  A  an  index  ranging  over  all  parent  configurations 
except  for  Xa,  that  is,  7T7*  (A f)  =  (aq 1  :  X*  E  ^{Xf)  A  i  a  A  ki  G  {1, . . .  ,  r*}), 
and  5  >0  is  a  constant. 

This  roughly  means  that  observing  a  greater  state  for  Xa  makes  more  likely 
to  have  greater  states  in  Xb.  The  definition  makes  use  of  cumulative  probability 
values,  so  it  works  even  for  non-binary  variables.  When  applied  to  binary  variables, 
summations  of  Equation  (3)  disappear  and  we  have  a  more  natural  formulation: 
0ij22  >  6bjl2  +  5,  with  jk  as  in  Definition  1.  In  this  case,  the  greater  state  is  2,  and 
observing  x2a  makes  more  likely  to  have  x2. 

If  constraints  of  Definition  1  hold  for  5  >  0,  the  influence  is  said  strong  with 
threshold  5  [20].  Otherwise,  it  is  said  weak  for  5.  A  negative  influence  is  obtained 
by  replacing  the  inequality  operator  >  by  <  and  the  sign  of  the  5  term  to  negative 
in  Equation  (3).  A  zero  influence  is  obtained  by  changing  inequality  to  an  equality. 

We  now  define  a  conjugate  influence  from  two  parents  to  a  common  child, 
called  additive  synergy  [25].  Synergies  are  influences  from  two  parents  acting 
to  influence  the  child.  For  example,  suppose  we  bomb  an  area  and/or  perform  a 
ground  attack.  Both  actions  contribute  to  the  goal.  But  the  probability  of  success 
given  that  we  do  both  or  neither  is  lesser  than  the  probability  of  success  given  that 
we  do  only  one  of  them  (bombing  an  area  under  ground  attack  could  kill  our  own 
infantry).  So,  the  actions  together  have  a  negative  influence  on  achieving  the  goal. 

Definition  2  Let  Xa  and  Xc  be  parents  of  Xb.  We  say  that  Xa  and  Xc  impose  a 
negative  additive  synergy  on  Xb  if 

^kb=l,...,rb-l 

®bjka,kck  -  X/  ^bxa,kck  -  ^  ^ 

kf>k\)  ka=kc  kf>k\j  ka^kc 

where  jka,kc  =  (x^“,  x^c,  7T7*  (Xb))  and  j *  ranges  over  all  parent  configurations 
not  including  Xa  nor  Xc,  and  5  >  0  is  a  constant. 

This  means  that  observing  the  same  configuration  for  the  parents  Xa  and  Xc 
makes  less  likely  to  have  a  greater  state  in  Xb.  Again  the  case  over  binary  variables 
is  simpler:  9bjl  l2  +  0bj2  2 2  <  0bj12 2  +  0bj2  l2  -  6,  where  jka,kc  is  as  in  Definition 
2,  which  enforces  the  sum  of  parameters  with  equal  configurations  for  Xa  and  Xc 
to  be  lesser  than  the  sum  of  parameters  with  distinct  configurations.  If  these  con¬ 
straints  hold  for  6  >  0,  this  synergy  is  said  strong  with  threshold  5  [20].  Otherwise 
it  is  said  weak  for  6.  When  omitted,  <5  is  assumed  to  be  zero.  Positive  and  zero 
additive  synergies  are  obtained  analogously. 

Non-monotonic  influences  and  synergies  happen  when  constraints  hold  only 
for  some  configurations  of  inactive  parents  (regarding  that  constraint)  [21].  For 


example,  suppose  three  binary  variables  such  that  X\  has  X2  and  X;  as  parents 
and  that  91(x2,x i)2  >  9 i(xiiXi)2  holds,  but  91^x2^2  >  9^1 x^2  can  not  be  stated. 
Hence  we  do  not  have  a  positive  influence  of  X2  on  X\,  because  it  would  be  nec¬ 
essary  to  have  both  constraints  valid  to  ensure  that  influence.  In  fact  we  might 
realize  that  the  state  of  X3  (the  inactive  parent  concerning  this  influence)  is  im¬ 
portant  in  the  constraint.  Then,  we  may  state  a  non-monotonic  influence  of  X2 
on  X\  that  holds  when  X3  is  x\  but  not  when  it  is  xfi  Situational  signs  [4]  and 
context-specific  signs  [2  ]  are  special  cases  of  such  non-monotonic  constraints.  To 
include  all  such  situations,  we  define  a  very  general  constraint:  a  linear  relation¬ 
ship  constraint  defines  a  linear  relative  relationship  between  sets  of  parameters  and 
numerical  bounds. 


Definition  3  Let  9  a  be  a  sequence  of  parameters,  a  a  a  corresponding  sequence  of 
constant  numbers  and  a  also  a  constant.  A  linear  relationship  constraint  is  defined 
as 

^  '  Oiijk  '  9{jk  S  Oi-,  (5) 

@ijk£  @  A 

that  is,  any  linear  constraint  over  parameters  can  be  expressed  as  a  linear  relation¬ 
ship  constraint.  It  is  worth  to  mention  some  special  cases:  if  9a  has  only  one  pa¬ 
rameter  9ijk  and  at]k  =  1,  the  constraint  becomes  a  upper  bound  constraint  for  9vq. 
(we  can  obtain  a  lower  bound  using  negative  and  a).  These  bounds  are  also 
known  as  range  constraints  [16].  If  all  parameters  involved  in  a  linear  relationship 
constraint  share  the  same  node  index  i  and  parent  configuration  j,  the  constraint 
is  called  intra-relationship  constraint.  Otherwise,  it  is  a  inter-relationship  con¬ 
straint.  We  indicate  this  difference  because  usually  inter-relationship  constraints 
lead  to  hard  learning  procedures  [16].  Here,  as  we  will  discuss,  there  is  no  im¬ 
portant  distinction  between  them  regarding  complexity.  We  note  that  linear  rela¬ 
tionship  constraints  generalize  influences  and  additive  synergies  and  correspond¬ 
ing  non-monotonic  relations,  as  all  of  them  are  linear  constraints  over  parameters. 
However  we  decided  to  keep  such  definitions  separately  given  their  importance  in 
the  literature. 


5  Learning  through  convex  optimization 

The  definitions  of  previous  section  show  many  types  of  qualitative  constraints  that 
can  be  used  to  describe  our  knowledge.  In  this  section  we  present  an  optimization 
idea  to  solve  the  learning  problem  using  a  great  variety  of  constraints.  The  main 
achievements  are 

•  It  is  possible  to  mix  different  types  of  constraints  in  a  straightforward  way. 
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Parameters  of  the  network  can  appear  as  many  times  as  desired. 


•  There  is  no  distinction  about  creating  a  qualitative  constraint  within  a  single 
probability  distribution  or  among  parameters  in  many  distributions. 

•  Many  types  of  qualitative  relations  can  be  handled  (any  convex  constraint), 
including  usual  relations  defined  in  the  literature  [4,  16,  17,  18,  22,  25]. 

We  describe  how  constraints  and  the  log  likelihood  function  can  be  formulated 
as  a  convex  constraint  of  a  convex  optimization  program.  Such  a  program  can  be 
specified  as  [3]: 


min  f(8) 

s.t.\/i= gi(6 )  <0 


(6) 


where  0  is  the  vector  of  optimization  variables,  m  is  the  number  of  constraints  and 
/  and  gi  (for  all  i)  are  continuous  convex  functions  over  the  parameter  space. 

First  note  that  our  objective  function  is  concave  because  log  is  concave  and 
a  positive  linear  combination  of  concave  functions  is  also  concave  (each  nV]\-  is 
positive,  known  from  the  dataset.  If  =  0,  the  term  is  simply  discarded).  Thus 


=  mm  — 
e 


where  the  right-hand  side  is  a  minimization  of  a  convex  function  (-/  is  convex 
when  /  is  concave),  as  required  in  Equation  (6).  Regarding  constraints,  any  linear 
function  is  convex.  Simple  manipulations  can  lead  them  to  the  form  of  Equation 
(6)  (a  constraint  h(0)  >  0  just  need  to  be  multiplied  by  —1,  while  an  equality  can 
be  viewed  as  two  inequalities).  All  constraints  defined  in  Section  4  are  convex  and 
can  be  directly  inserted  into  the  convex  program. 

To  solve  convex  programming,  there  are  many  optimization  algorithms.  We 
can  use  specialized  interior  point  solvers  [2]  or  even  some  general  optimization 
ideas  [15],  because  convex  programming  has  the  attractive  property  that  any  local 
optimum  is  also  a  global  optimum.  Furthermore,  such  global  optimum  can  be 
found  in  polynomial  time  in  the  size  of  input  [3]. 

5.1  Incomplete  data 

Incomplete  data  means  that  some  fields  of  the  dataset  are  unknown.  If  the  dataset  is 
D  =  {Di, . . . ,  D]\r},  then  each  Dt  C  {x\\ , . . . ,  x^t)  is  a  sample  of  some  nodes. 


We  say  that  ut  is  the  missing  part  in  tuple  t,  that  is,  ut  fl  Dt  =  0  and  ut  U  Dt  is  a 
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complete  instantiation  for  the  nodes.  Let  U  be  the  set  of  all  missing  data.  In  this 
case,  the  likelihood  function  \og(p{D\6))  is  not  a  simple  product  anymore,  and  the 
corresponding  optimization  program  is  not  convex. 

A  common  method  to  overcome  this  situation  is  standard  EM  algorithm  [8], 
which  starts  from  some  initial  guess,  and  then  iteratively  takes  two  types  of  steps 
(E-steps  and  M-steps)  to  get  a  local  maximum  of  the  likelihood  function.  Partic¬ 
ularly  for  discrete  nodes,  E-step  computes  the  expected  counts  for  all  parameters, 
and  M-step  estimates  the  parameters  by  maximizing  log  likelihood  function,  given 
the  counts  from  E-step,  just  like  would  be  done  with  a  complete  dataset.  EM  algo¬ 
rithm  converges  to  a  local  maximum  under  very  few  assumptions  [28]. 

Assume  9°  is  an  initial  guess  for  the  parameters,  and  9t  denotes  the  estimation 
after  t  iterations,  t  =  1,2, ....  Then,  each  iteration  of  EM  can  be  summarized  as 
follows: 

•  E-step:  compute  expectation  of  the  log  likelihood  given  observed  data  D  and 
current  estimation  of  parameters  9 t:  Q(9\9t )  =  Eet[logp(U  U  D\ 9)19*,  D\. 

•  M-step:  find  new  parameter  9t+1,  which  maximizes  expected  log  likelihood 

computed  in  E-step:  9t+1  =  arg  maxQ(@|(9f). 

6 

We  propose  to  extend  EM  with  the  convex  optimization  program  of  Section  5, 

that  is,  the  M-step  is  performed  using  convex  programming.  The  value  of  9t+l  is 

arg  max  Q{9\91)  subject  to  qualitative  constraints  of  the  problem,  and  a  polynomial 
9 

time  algorithm  can  be  employed  to  solve  this  convex  program  (also  described  in 
Section  5).  In  this  context,  qualitative  constraints  may  help  EM  to  avoid  poor  local 
maxima  and  improve  the  overall  solution.  Furthermore,  because  the  parameter 
space  is  convex  and  the  enhanced  M-step  produces  a  global  optimum  solution  for 
the  current  parameter  counts,  this  modified  EM  shares  convergence  and  optimality 
properties  of  the  standard  EM  algorithm  [5,  28].  Although  the  modified  EM  is 
more  time  expensive  than  the  standard  EM  as  each  M-step  requires  the  solution 
of  a  convex  optimization  program  (standard  EM  may  use  closed  form  solution  for 
ML),  we  argue  that,  just  as  in  standard  EM  where  an  improving  solution  is  enough 
instead  of  an  optimum  one  (called  Generalized  EM),  we  might  stop  the  convex 
programming  as  soon  as  an  improving  solution  is  found. 

6  Experiments 

In  order  to  test  the  performance  of  our  method  against  standard  ML  estimation  and 
standard  EM  algorithm  given  sparse  and  incomplete  data,  we  use  a  similar  Influ¬ 
ence  Diagram  as  described  by  Zhang  and  Ji  [30]  for  an  hypothetical  EBO-based 
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Figure  2:  Influence  Diagram  for  an  hypothetical  EBO-based  planning  problem. 

military  planning  problem.  It  is  shown  in  Figure  2.  The  goal  is  to  win  a  war  and 
it  is  represented  by  the  Hypothesis  node  (on  top  of  Figure  2).  Just  below  there 
are  the  subgoals  Air  superiority  and  Territory -occupation,  which  are  directly  re¬ 
lated  to  the  main  goal.  There  are  eight  decision  nodes  (represented  by  rectangles, 
while  chance  nodes  are  ellipses):  destroy JC2  (C2  stands  for  Command  and  Con¬ 
trol).  destroy -Radars,  destroy -Communications,  launch  Mir  strike,  destroy  JU), 
destroy  storage,  destroy  .assembly,  launch-ground _  attack.  Just  above  decision 
nodes,  we  have  chance  nodes  representing  the  outcomes  of  performing  such  ac¬ 
tions  (they  indicate  the  workability  of  such  systems),  and  below  we  have  value 
nodes  (diamond-shaped  nodes)  describing  the  cost  of  each  action.  Furthermore,  we 
have  four  chance  nodes  (in  the  center  of  the  figure)  indicating  general  workability 
of  IADS  (Integrated  Air  Defense  System),  Airforce,  Artillery  and  Ground  force  of 
the  enemy.  The  overall  profit  of  winning  is  given  by  the  value  node  Uh,  child  of 
Hypothesis. 

As  this  is  an  hypothetical  example,  we  define  utility  functions  and  probability 
distributions  as  follows: 

•  Probability  of  Hypothesis  is  one  given  that  all  subgoals  are  achieved.  If  one 
of  subgoals  is  not  achieved,  then  the  probability  of  Hypothesis  is  60%,  and 
if  none  of  subgoals  is  achieved,  then  we  certainly  fail  in  the  campaign. 

•  For  the  subgoals  Air  superiority  and  Territory -occupation,  we  define  them 
as  accomplished  with  probability  one  when  both  children  were  achieved, 
50%  when  only  one  child  is  achieved,  and  zero  when  none  is  achieved. 

•  For  the  probabilities  of  IADS,  Airforce,  Artillery  and  Ground-force,  we  de- 
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fine  a  decrease  of  30%  for  each  unaccomplished  child  (with  a  minimum  of 
zero,  of  course).  For  instance,  IADS  has  probability  40%  if  two  of  its  four 
children  are  achieved,  and  Artillery  has  probability  zero  when  at  least  four 
of  its  children  are  unaccomplished. 

•  The  probability  of  the  outcomes  of  actions  (chance  nodes  just  above  deci¬ 
sion  nodes),  we  define  a  rate  of  90%  of  success.  For  example,  the  deci¬ 
sion  destroy -Radars  will  have  EW/GCIjadars  destroyed  with  90%  of  odds 
(EW/GCI  means  Early  Warning/Ground  Control  Interception). 

•  The  reward  of  achieving  the  main  goal  is  1000,  while  not  achieving  it  costs 
500. 

•  Costs  of  actions  are  as  follows:  ground-attack  is  150,  airstrike  is  50,  and 
other  actions  cost  20  each. 

Using  this  Influence  Diagram  as  our  “truth”  for  the  parametrization,  we  gen¬ 
erate  samples  from  it.  After  training  the  model  with  distinct  amounts  of  data,  we 
apply  the  Kullback-Leibler  (KL)  divergence  criterion  [14]  to  measure  the  differ¬ 
ence  between  joint  probability  distributions  induced  by  generated  diagrams  and 
distributions  of  the  “truth”  diagram.  For  probability  distributions  Pi  and  P2  over 
discrete  variables,  where  Pi  is  considered  the  “truth”,  the  KL  divergence  is 

PL(Pi,P2)  =  ^Pi(f)log^, 

where  i  ranges  over  all  possible  configurations  of  the  discrete  variables  and  Pj  ( % ) 
means  the  probability  value  for  the  configuration  i.  We  have  chosen  to  use  such 
criterion  because  the  problem  is  hypothetical  and  no  real  data  is  available.  Thus, 
it  would  not  be  significant  to  compare  maximum  expected  utilities  of  the  diagrams 
generated  with  each  approach,  because  we  could  obtain  a  great  expected  utility 
even  using  wrong  parameters.  Such  result  would  be  unreliable,  as  we  desire  to 
obtain  parameters  that  best  describe  a  real  situation,  and  those  parameters  could 
not  lead  to  good  results.  So,  using  the  KL  divergence,  we  measure  how  much 
improvement  the  qualitative  knowledge  obtains  for  parameter  learning  towards  the 
correct  values. 

We  conduct  experiments  for  datasets  with  10,  100  and  1000  samples,  without 
and  with  qualitative  constraints.  When  employed,  the  qualitative  constraints  state 
that  each  chance  or  decision  node  has  a  positive  influence  on  its  children,  that 
is,  achieving  the  parent  makes  more  likely  to  accomplish  the  children.  We  use 
the  following  constraints.  Although  they  are  hypothetical,  we  tried  to  create  as 
meaningful  constraints  as  possible. 
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•  Air  superiority  has  a  strong  positive  influence  on  Hypothesis,  as  well  as  Ter¬ 
ritory  ^occupation,  which  means  that  the  probability  of  success  on  Hypoth¬ 
esis  given  that  we  have  achieved  Air  superiority  is  much  greater  than  the 
the  probability  of  success  on  Hypothesis  given  that  we  have  not  achieved 
Air  superiority,  as  subgoals  contribute  to  the  main  goal.  Analogously,  the 
probability  of  Hypothesis  given  we  have  success  on  Territory -occupation 
is  much  greater  than  the  probability  of  Hypothesis  given  failure  on  Terri¬ 
tory  ^occupation. 

•  IADS  and  A  irjorce  have  strong  positive  influences  on  Air  superiority ,  which 
means  that  the  probability  of  Air  superiority  given  each  one  of  those  achieve¬ 
ments  is  much  greater  than  when  we  do  not  obtain  them.  Again,  IADS  and 
A  irjorce  are  important  steps  to  achieve  Air  superiority,  so  they  contribute 
positively  to  it. 

•  Artillery  and  Ground-force  have  positive  influence  on  Territory -occupation, 
so  the  probability  of  the  latter  is  greater  when  Artillery  and/or  Ground-force 
are  accomplished. 

•  Destroyed  EW/GCI  radars  has  negative  influence  on  IADS  and  Air-force  of 
the  enemy,  as  well  as  inoperable  Communications  and  C2  and  performed 
Airstrike. 

•  Destroyed  RDfacility  has  negative  influence  on  Artillery  and  Ground-force 
of  the  enemy,  as  well  as  inoperable  C2,  storagefacility,  assembly  facility  and 
performed  ground -attack. 

•  Probability  of  Air  superiority  given  success  on  IADS  and  Airforce  is  greater 
than  the  probability  of  Territory -occupation  given  Artillery  and  Ground  force, 
because  we  consider  (in  this  example)  that  it  is  easier  to  control  the  airspace 
than  to  control  the  surface. 

For  each  amount  of  data  (10,  100  and  1000  samples),  we  work  with  20  random 
sets  of  data  and  qualitative  constraints.  Mean  and  variance  are  presented  in  first 
three  lines  of  Table  1  for  complete  data.  The  last  three  lines  of  the  table  show 
results  for  datasets  with  missing  fields  (chosen  at  random  in  number  equal  to  three 
times  the  number  of  samples).  In  such  cases,  EM  and  constrained  EM  methods  are 
employed. 

Table  1  indicates  a  decrease  in  the  divergence  when  working  with  qualitative 
constraints,  which  shows  that  such  constraints  were  actively  used  during  the  learn¬ 
ing  process.  For  sparse  data,  the  divergence  was  substantially  reduced  so  as  harder 
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Amount  of  data 

Mean 

Unconstrained 

Constrained 

Variance 

Unconstrained  Constrained 

10 

8.4 

2.39 

0.35 

0.64 

too 

2.25 

0.32 

0.09 

0.03 

1000 

0.06 

0.08 

0.0005 

0.00004 

10(1) 

8.12 

2.46 

0.15 

0.1 

100  (I) 

2.33 

0.36 

0.15 

0.02 

1000  (I) 

0.08 

0.08 

0.0004 

0.0001 

Table  1 :  KL  divergence  for  20  runs  of  the  learning  procedure  using  random  sam¬ 
ples  and  constraints.  Unconstrained  results  are  the  standard  case  (ML  /  EM  ideas) 
while  Constrained  indicate  the  use  of  qualitative  relations  during  learning  (Con¬ 
strained  ML  /  Constrained  EM).  Rows  marked  with  (I)  were  executed  with  incom¬ 
plete  datasets. 


problems  are  most  benefited.  Besides  that,  results  show  a  somewhat  expected  sit¬ 
uation:  when  enough  data  is  available,  qualitative  constraints  become  less  useful. 
We  can  even  see  a  case  where  the  average  result  with  qualitative  constraints  was 
sightly  worse  than  without  constraints  (1000  samples  without  missing  data).  That 
situation  eventually  happens  when  enough  data  is  available  and  there  exist  quali¬ 
tative  constraints  that  are  not  precise  with  respect  to  the  ground  truth.  Thus,  we 
emphasize  that  problems  like  EBO-based  planning  may  have  strong  benefits  from 
qualitative  knowledge,  as  usually  there  are  not  enough  data,  while  problems  where 
enough  data  are  available  may  not  have  the  same  benefits. 

Figures  3  and  4  show  the  difference  in  the  KL  divergence  between  the  un¬ 
constrained  learning  (standard  case)  and  the  constrained  learning  with  qualitative 
relations  for  each  node  of  the  Influence  Diagram.  Black  bars  are  for  10-sample 
cases  and  white  bars  for  100-sample  cases.  Node  numbers  in  these  graphs  are 
defined  from  1  to  23  according  to  the  following  order:  (1)  C2,  (2)  EW/GCI,  (3) 
Communications,  (4)  Air  strike,  (5)  IADS,  (6)  Air-force,  (7)  Air  superiority,  (8) 
RDfacility,  (9)  storagefacility,  (10)  assembly  facility,  (11)  ground -attack,  ( 1 2)  Ar¬ 
tillery,  (13)  Ground-force,  (14)  Territory -occupation,  (15)  Hypothesis.  We  can  see 
that  most  benefits  of  qualitative  constraints  appear  in  nodes  that  are  hard  to  learn, 
for  instance  Hypothesis,  its  two  children  Air  superiority  and  Territory -occupation, 
and  their  children.  We  can  verify  that  advantages  with  complete  and  incomplete 
datasets  were  very  similar  to  each  other,  which  shows  that  even  with  incomplete 
data  the  qualitative  constraints  help  to  guide  the  learning  process. 
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KL  divergence  for  complete  data 

1 1 0  samples  (Unconstr’d  minus  Constr'd) 

]  100  samples  (Unconstr’d  minus  Constr'd) 


Figure  3:  Difference  in  KL  divergence  between  unconstrained  (standard)  learning 
and  constrained  learning  for  complete  data.  Positive  bars  mean  that  the  constrained 
version  performed  better  than  the  unconstrained  one. 

7  Conclusion 

This  report  describes  a  framework  for  parameter  learning  of  Influence  Diagrams 
when  qualitative  knowledge  is  available,  which  is  specially  important  for  training 
with  sparse  data.  Even  with  enough  data,  qualitative  constraints  may  help  to  guide 
the  learning  procedures. 

For  complete  data,  we  directly  apply  convex  optimization  to  obtain  a  global 
optimum  of  the  constrained  maximum  likelihood  estimation,  while  for  incomplete 
data,  we  extend  the  EM  method  by  introducing  a  constrained  maximization  in  the 
M-step.  We  apply  our  method  to  synthetic  Influence  Diagrams  based  on  EBO 
military  planning.  Experimental  results  demonstrate  that  our  method  can  fully 
exploit  the  qualitative  knowledge  to  improve  parameter  learning  accuracy.  With 
sparse  data  and  constraints,  it  is  possible  to  obtain  results  similar  to  those  of  using 
expressively  more  data  without  constraints. 

For  future  research,  we  intend  to  apply  the  ideas  on  harder  planning  prob¬ 
lems  where  only  sparse  data  is  available.  We  plan  to  explore  other  applications  of 
qualitative  constraints,  such  as  strategy  evaluation  (trying  to  reduce  the  time  for 
each  evaluation)  and  planning  (constraints  may  guide  the  search  for  the  best  strat¬ 
egy).  Furthermore,  qualitative  constrains  presented  here  can  be  viewed  as  hard 
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KL  divergence  for  incomplete  data 


1 1 0  samples  (Unconstr’d  minus  Constr'd) 
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Figure  4:  Difference  in  KL  divergence  between  unconstrained  (standard)  learning 
and  constrained  learning  for  incomplete  data.  Positive  bars  mean  that  the  con¬ 
strained  version  performed  better  than  the  unconstrained  one. 


constraints,  since  they  must  be  satisfied  during  the  learning.  We  intend  to  explore, 
together  with  hard  constraints,  some  soft  qualitative  constrains  in  the  sense  that 
estimations  may  eventually  not  comply  with  them  (to  avoid  possible  “imprecise” 
constraints). 
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Exploiting  Qualitative  Constraints  for  Learning 
Bayesian  Networks  under  Insufficient  Data 


Zheng  Xue  and  Qiang  Ji 

Rensselaer  Polytechnic  Institute 


Abstract 


Graphical  models  (GMs)  such  as  Bayesian  Networks  (BN)  or  the  Influence  Di¬ 
agrams  (ID)  are  being  increasingly  applied  to  many  different  applications.  One 
bottleneck  in  using  GMs  is  that  learning  the  GM  model  parameters  often  requires 
a  relative  large  amount  of  training  data.  However,  in  real  life  and  for  many  applica¬ 
tions,  training  data  is  often  incomplete  or  sparse,  which  can  cause  low  learning  ac¬ 
curacy.  Incorporating  domain  knowledge  can  help  alleviate  this  problem.  Instead 
of  using  quantitative  prior  knowledge  as  used  by  most  of  the  existing  methods, 
this  paper  introduces  a  novel  learning  method  based  on  systematically  combining 
the  training  data  with  some  qualitative  knowledge. 

To  validate  our  method,  we  compare  it  with  the  Maximum  Likelihood  (ML)  es¬ 
timation  method  under  sparse  data  and  with  the  Expectation  Maximization  (EM) 
algorithm  under  incomplete  data  respectively.  The  experimental  results  show  that 
our  method  improves  the  parameter  learning  accuracy  significantly  compared  with 
both  ML  and  EM  algorithms. 


1  Introduction 


Among  all  the  issues  of  graphical  models,  parameter  learning  is  one  of  the  main  challenges.  Pa¬ 
rameter  learning  is  to  estimate  the  entries  of  the  conditional  probability  distributions  (CPDs)  given 
the  structure  of  a  model.  Many  learning  techniques  rely  heavily  on  training  data  [7].  Ideally,  with 
sufficient  data,  it  is  possible  to  learn  the  parameters  by  standard  statistical  analysis  like  maximum 
likelihood  (ML)  estimation.  In  many  real-world  cases,  however,  the  data  are  either  incomplete  or 
sparse,  which  can  cause  inaccurate  parameter  estimation.  Data  incompleteness  is  defined  as  missing 
of  data  for  some  parameters,  while  data  sparseness  means  the  amount  of  training  data  is  limited. 

When  data  are  incomplete,  Expectation-Maximization  (EM)  [3]  algorithm  is  often  used.  Most 
EM-based  methods  work  under  the  assumption  that  data  are  missing  at  random  (MAR),  which  means 
the  missing  values  can  be  estimated  by  the  observed  ones  in  some  way.  However,  when  data  are 
missing  completely  at  random  (MCAR),  e.g  data  of  hidden  nodes,  the  learned  parameters  could  be 
far  from  the  ground  truth.  The  reason  is  that  the  missing  data  do  not  even  depend  on  the  observed 
ones,  and  there  is  no  way  to  estimate  the  missing  data  only  from  the  observed  ones. 

In  our  paper,  we  propose  a  framework  to  solve  the  parameter  learning  problem  by  combining 
quantitative  data  and  domain  knowledge  in  the  form  of  qualitative  constraints.  Two  kinds  of  qual¬ 
itative  constraints  are  defined:  range  constraints  which  are  applied  to  individual  parameters;  and 
relationship  constraints  which  are  applied  to  pairs  of  parameters.  For  sparse  but  complete  data,  we 
solve  the  learning  task  by  reformulating  the  problem  as  a  constrained  ML  (CML)  problem.  For 
incomplete  data,  we  introduce  the  constrained  EM  (CEM)  by  adding  constraints  to  the  M  step,  and 
iteratively  solve  the  learning  problem.  In  addition,  we  provide  closed  form  solutions  to  both  CML 
and  CEM. 
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2  Related  Work 


We  have  already  discussed  that  one  of  the  shortcomings  of  EM  algorithms  is  that  it  can  easily  be 
trapped  in  a  local  maximum  when  data  are  MCAR.  Till  now,  there  are  many  different  methods 
to  help  EM  to  escape  from  the  local  maximum,  such  as  the  information-bottleneck  EM  algorithm 
[4],  data  perturbing  method  [5],  and  AI&M  procedure  [8],  These  methods  focus  on  improving  the 
machine  learning  techniques,  but  ignoring  the  useful  domain  knowledge. 

Domain  knowledge  can  be  classified  as  quantitative  and  qualitative  knowledge,  which  describe 
the  explicit  quantification  of  parameters,  and  approximate  characterizations  of  parameters  respec¬ 
tively.  Both  kinds  of  domain  knowledge  are  useful  for  parameter  learning.  While  the  quantitative 
knowledge  has  been  widely  used  in  the  form  of  prior  probability  distributions,  qualitative  constraints 
have  not  been  fully  exploited  in  parameter  learning  yet. 

Wittig  et  al.  [12]  present  a  method  to  integrate  qualitative  constraints  into  two  learning  al¬ 
gorithms,  APN  [9]  and  EM,  by  adding  violation  functions  as  a  penalty  term  to  the  log  likelihood 
function.  They  show  that  domain  knowledge  in  the  form  of  constraints  can  improve  learning  accu¬ 
racy.  However,  this  penalty-based  method  cannot  guarantee  to  find  the  global  maximum.  Besides, 
the  weights  for  the  penalty  functions  often  need  be  manually  tuned,  depending  on  applications.  Al- 
tendorf  et  al.  [1]  describes  a  method  to  incorporate  monotonicity  constraints  into  learning  algorithm. 
It  is  based  on  the  assumption  that  the  values  of  the  variables  can  be  totally  ordered.  Additionally,  it 
also  uses  the  penalty  functions,  which  suffers  from  the  same  problem  as  [12].  Feelders  and  Van  der 
Gaag  [6]  incorporate  some  simple  inequality  constraints  in  the  learning  process.  They  assume  that 
all  the  variables  are  binary.  The  constraints  used  in  the  above  methods  [1,  12,  6]  are  restrictive,  as 
each  constraint  has  to  involve  all  parameters  in  a  conditional  probability  table  (CPT). 

Campos  and  Cozman  [2]  formulate  the  learning  problem  as  a  constrained  optimization  problem. 
However,  they  do  not  provide  a  specific  method  to  solve  the  optimization  problem.  Niculescu  et 
al.  [11]  also  solve  the  learning  problem  by  optimization  techniques.  They  derive  the  closed  form 
solutions  with  ML  estimation  for  two  kinds  of  constraints:  inequalities  between  sums  of  parameters 
and  upper  bounds  on  sum  of  parameters  within  a  CPT.  There  are  two  main  limitations  of  their 
method:  First,  they  assume  one  parameter  can  and  only  can  have  one  constraint,  and  there  is  no 
overlap  between  parameters  of  different  constraints.  Second,  their  method  cannot  handle  constraints 
from  different  CPTs.  We  improve  their  method  by  deriving  the  closed  form  solution  for  range 
constraints,  which  contain  both  upper  bound  and  lower  bound  constraints  for  the  same  parameters. 
In  addition,  the  relationship  constraints  defined  in  our  paper  can  either  be  within  or  between  CPTs. 


3  Problem  Definition  and  Approach 

3.1  Basic  Parameter  Learning  Theory 

We  focus  on  parameter  learning  in  a  Bayesian  Networks  with  all  discrete  nodes,  where  the  structure 
is  known  in  advance.  The  method  can  be  extended  to  other  graphical  models  including  the  IDs.  The 
notations  are  defined  as  follows.  Assume  a  BN  with  n  nodes,  0  is  the  entire  vector  of  parameters, 
and  Oijk  denotes  one  of  the  parameters.  9t.jk  =  p(x1l\pa3i ),  where  i  (i  =  1  ranges  over 

all  the  variables  in  the  BN,  j  (j  =  1, ...,  q,)  ranges  over  all  the  possible  parent  configurations  of 
node  (variable)  X^,  and  k  (k  =  1,  ...,ri)  ranges  over  all  the  possible  states  of  X.;.  Therefore,  x * 
represents  the  kth  state  of  node  X,,  and  pa\  is  the  jth  parent  configuration  of  node  X , . 

Given  a  dataset  D  =  {D i, ...,  Djv},  which  consists  of  samples  of  the  BN  nodes,  the  goal  of 
parameter  learning  is  to  find  the  most  probable  values  6  for  9  that  can  best  explain  the  dataset  D, 
which  is  usually  quantified  by  the  log  likelihood  function  log(p(D|0)),  denoted  as  L  d(9).  Assuming 
that  the  examples  are  drawn  independently  from  the  underlying  distribution,  based  on  the  conditional 
independence  assumptions  in  BNs,  we  have  the  log  likelihood  function  in  Eq.(l),  where  n  ijk  is  the 
count  for  the  case  that  node  i  has  the  state  with  the  state  configuration  j  for  its  parent  nodes. 


Ld(9)  =  log  nnn* 

i—lj—l k—1 


i  'R’ijk 

ijk 


(1) 
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If  the  dataset  D  is  complete,  ML  estimation  method  can  be  described  as  a  constrained  optimiza¬ 
tion  problem,  i.e.  maximize  (Eq.(2)),  subject  to  n  equality  constrains  (Eq.(3)). 


Max  LD{9)  (2) 

S  T.  9ij(0)  =  El-Li  -  1  =  0  (3) 

where  g,:i  imposes  the  constraint  that  each  parameter  sums  to  1  over  all  its  state,  1  <  i  <  n  and  1  < 

3  <  <H- 

If  dataset  D  is  incomplete,  ML  estimation  cannot  be  applied  directly.  A  common  method  is 
standard  EM  algorithm  [3],  which  starts  from  some  initial  point,  and  then  iteratively  takes  E  step 
and  M  step  to  get  a  local  maximum  of  the  likelihood  function.  Particularly  for  discrete  nodes, 
E  step  computes  the  expected  counts  for  all  parameters,  and  M  step  estimates  the  parameters  by 
maximizing  log  likelihood  function,  given  the  counts  from  E  step.  EM  algorithm  can  guarantee  to 
converge  to  a  local  maximum.  However,  depending  on  different  initializations,  it  may  converge  to 
different  local  maxima.  When  there  are  a  large  number  of  missing  data,  which  means  there  are  many 
local  maxima,  EM  algorithm  can  get  stuck  in  a  local  maximum  far  away  from  the  global  one. 

3.2  Qualitative  Constraints 

We  introduce  two  kinds  of  qualitative  constraints,  which  can  be  easily  specified  by  domain  experts. 
They  are  range  and  relationship  constraints. 

Range  constraint  defines  the  upper  bound  and  lower  bound  of  some  parameters.  Assuming  a  ijk 
and  /3ijk  are  the  upper  bound  and  lower  bound  for  parameter  9  ijk ,  then  the  range  constraints  can  be 
defined  as  follows: 

Pijk  —  Oijk  —  tXijk  (4) 

where  0  <  a,,t  <  1  and  0  <  /3yfc  <  1 

Relationship  constraint  defines  the  relative  relationship  between  a  pair  of  parameters.  If  both  of 
the  two  parameters  in  a  relationship  constraint  share  the  same  node  index  i,  and  parent  configuration 
j,  the  constraint  is  called  intra-relationship  constraint ,  which  can  be  represented  as  follows: 

Oijk  <  Oijk >  where  k  ^  k'  (5) 

If  the  two  parameters  in  a  relative  relationship  constraint  do  not  satisfy  the  requirement  of  an  intra¬ 
relationship  constraint,  the  constraint  is  called  inter-relationship  constraint.  It  can  be  described  as 
follows: 

Oijk  <  Oi'j'k'  where  i  ±  i'  or  j  j'  (6) 


3.3  Overview  of  Our  Approach 

We  aim  to  solve  the  learning  problem  by  reformulating  the  problem  as  a  constrained  based  opti¬ 
mization  problem,  i.e.. 

Max  Ld(0)  (7) 

S.T.  gij(O)  =  YJk=  i  Oijk  -1  =  0,  1  <  i  <  n,  and  1  <  j  < 
hp(0)  <0,  1  <  p  <  S 

where  hp(x)  <  0  denotes  the  inequality  constraints,  and  S  is  the  total  number  of  inequality  con¬ 
straints.  Using  the  Lagrange  multipliers  Ay  and  pp,  the  objective  function  to  be  maximized  can  be 
incorporated  with  the  constraints,  producing  the  following  augmented  objective  function 

n  qi  S 

m  =  Ld(0)  —  ^ llphp{0)  (8) 

i= 1  j=l  k= 1 

Given  Eq.(8),  for  sparse  but  complete  data,  we  can  directly  apply  the  CML  method  by  maxi¬ 
mizing  Eq.(8)  to  estimate  the  parameters.  For  incomplete  data,  we  can  replace  the  M  step  of  EM 
algorithm  by  the  solution  to  Eq.(8),  and  iteratively  obtain  the  estimation  of  the  parameters.  In  the 
section  to  follow,  we  introduce  our  solution  to  Eq.(8). 
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4  Parameter  Learning  With  Qualitative  Constraints 


In  this  section,  we  derive  the  closed  form  solutions  for  maximizing  Eq.(8)  under  different  types 
of  constraints.  Because  of  the  decomposability  of  the  log  likelihood  function,  we  can  deal  with 
small  independent  optimization  subproblems  on  independent  parameter  sets  separately  instead  of 
dealing  with  all  parameters  simultaneously.  For  this,  we  define  two  kinds  of  parameter  sets:  one  is 
the  baseline  set,  which  contains  parameters  with  the  same  node  and  the  same  parent  configuration; 
the  other  is  the  combined  set ,  which  contains  several  baseline  sets.  We  first  separate  parameters  into 
baseline  sets,  and  then  if  there  is  a  constraint  on  parameters  from  different  baseline  sets,  we  combine 
those  baseline  sets  into  one  new  combined  set.  This  process  continues  until  there  is  no  constraint 
on  parameters  from  different  sets.  After  decomposition  of  parameters,  we  solve  the  constrained 
optimization  subproblems  set  by  set  independently. 

Specifically,  let  Q  denote  a  parameter  set.  Since  parameters  from  one  baseline  set  share  the 
same  node  i  and  the  same  parent  configuration  j,  we  use  (i.  j)  to  denote  the  index  of  a  baseline 
set.  A  baseline  set  can  be  denoted  as  Q  =  {(*,  j)},  while  a  combined  set,  which  consists  of  several 
baseline  sets,  can  be  denoted  as  Q  =  {(i,j),  ( i',j '), ...}. 

The  parameter  learning  problem  can  be  decomposed  into  subproblems,  one  for  each  set  of 
parameters.  A  subproblem  can  be  formulated  as  follows: 

Max  lD(d)  =  logn(i>i)eQn:>=iCr 

S  T.  9ij(0)  =  EfcU Oijk  -  1  =  0  for  (i,j)  e  Q 

hp(9)  <  0  for  1  <  p  <  Sq  (9) 

where  (ji3  represents  an  equality  constraint,  hp  represents  an  inequality  constraint,  Sq  is  the  number 
of  inequality  constraints  in  set  Q. 


Since  the  log  likelihood  function  is  concave,  and  the  qualitative  constraints  are  linear,  Karush- 
Kuhn-Tucker  (KKT)  conditions  [10]  become  sufficient  to  determine  the  solution  to  Eq.(9).  The 
KKT  conditions  for  the  problem  described  in  Eq.(9)  are: 


Sq 

(0)- 

E  ^i,j9ij(9) 

Hphp(9)\ 

( i,j)eQ 

p= 

L 

9ij  (9)  =  0, 

for 

(i 

,j)  e  Q 

hp(9)  <  0, 

for 

1  < 

P<  Sq 

9p  :  0; 

for 

1  < 

P<  Sq 

9p  : 

*  hp{9)  =  0, 

for 

1  < 

P<  Sq 

(10) 


(11) 


In  optimization,  an  inequality  constraint  hp  <  0  is  active  if  hp  =  0,  or  inactive  if  hp  <  0.  Based  on 
this  definition,  we  will  derive  closed  form  solutions  for  each  type  of  constraints. 


4.1  Range  Constraints 

Since  range  constraints  (Eq.(4))  are  applied  to  every  individual  parameters,  we  can  solve  the  sub¬ 
problems  with  range  constraints  within  baseline  sets.  There  are  two  constraints  for  each  parameter 
Oijk  in  a  baseline  set  Q  =  {(i,j)}-  h^(9)  =  O^k  —  oiijk  <  0  (upper  bound  constraint),  and 
hu  (9)  =  flijk  —  9ijk  <  0  (lower  bound  constraint). 

As  the  objective  function  is  concave  and  the  range  constraints  are  linear,  the  maximum  solution 
either  lies  inside  the  feasible  region  defined  by  all  constraints,  when  no  constraint  is  active,  or  on 
the  boundary  defined  by  the  active  constrains,  when  some  of  the  constraints  are  active.  Assuming 
Kq  and  Kq  are  the  sets  of  active  constraints  for  lower  bound  and  upper  bound  of  parameters  in  Q 
respectively,  and  Kq  =  Kq  U  Kq  represents  the  set  for  all  active  constraints  of  parameters  in  (), 
then  the  closed  form  solution  for  9v]k  is  as  follows: 

Pijk  if  k  G  I<q 

if  k  G  Kq 

Pijk-  V  aijk)  ^  niJfc -  otherwise 

EHKQ^jk 


@ijk  — 


a-  E 
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Table  1 :  Search  algorithm  for  finding  active  range  constraints 

Step  1:  Check  the  consistency  of  the  range  constraints: 

0  CXijk  C  1  ?  b  '  fiijk  ^  l-i  (%ijk  ^  Pijki 
£fe=i  Pijk  <  1,  and  l  “yfc  >  1  for  1  <  k  < 
ri.  If  satisfied,  continue;  else  change  constraints. 

Step  2:  If  5^fc*=i  aijk  =  1-  all  the  upper  bound  constraints 
should  be  active;  else  if  Y^k= i  flijk  =  1,  all  the 
lower  bound  constraints  should  be  active.  Else, 
continue. 

Step  3;  Perform  the  ML  estimation  of  parameters  without 
constraints.  Check  the  constraints  with  the  esti¬ 
mated  parameters  9*-k  =  If  no  constraint 

is  violated,  then  there  is  no  active  range  constraint. 
Else,  continue. 

Step  4;  List  all  possible  combinations  of  active  constraints, 
and  remove  the  combination  if  it  contains  more 
than  ri  —  1  active  constraints  or  Pijk  + 

aijk  >  1- 

Step  5;  For  each  of  the  remaining  combination,  compute 
A ij,  until  finding  a  Ay  satisfying  the  criteria  in 
_ Eq.(13). _ 


The  derivation  is  as  follows.  From  the  first  equation  of  KKT  conditions  (Eq.(ll)),  we  obtain 
Oijk  =  - — -■  Because  9ijk  cannot  be  greater  than  atjk  and  less  than  /3yfc  at  the  same  time,  at 

most  one  of  the  upper  bound  constraint  h  k  and  lower  bound  constraint  hk  for  a  parameter  9,jk  can 
be  active  at  a  time.  Based  on  whether  there  is  an  active  constraint  for  9  ijk ,  two  cases  are  considered. 


•  Case  1:  If  one  of  the  upper  bound  and  lower  bound  constraints  is  active,  then  9  ijk  = 
when  hk(9)  =  0;  and  9^k  =  0ijk>  when  hk (9)  =  0. 


Case  2:  If  range  constraints  are  not  active,  then  hk(9)  <  0,  ht  (9)  <  0  and  pk  =  pk  =  0. 
Hence  9ak  =  Summing  up  over  all  parameters  whose  constraints  are  not  active, 

J  *ij 


we  get:  (1  Pijk  Sfceifg  aijk)  —  Y^k<£KQ  — 


YhkiKQ  ni*k 


Thus, 


.  .  ,  ^2ic(k0  n*ik 

we  can  obtain  Ay  =  — ^ s - 


,  and  9ijk  = 


Ot-ij  k  ) 


Tlijlc 


Q  Q 

as  shown  in  Eq.(12). 


(1  Yhk^K^  Pijk 


k(KQ 


In  this  way,  we  derive  the  closed  form  solution  for  range  constraints.  To  obtain  solution  in  Eq.(12), 
we  need  to  identify  active  constraints.  Table  1  summarizes  the  algorithm  to  find  active  range  con¬ 
straints.  The  main  idea  of  this  algorithm  is  to  search  for  the  active  constraints  using  the  criteria  in 
Eq.(13).  Due  to  the  page  limit,  we  do  not  provide  the  proof  for  this  equation. 


A  a  A 


Ay  <  ^ 

3  &ijk 
\  \  njjk 

~  9 ijk 

nijk  \  .  .  <f 
<*ijk  ’  13  ~ 


Pijk 


keK% 

keK0Q 

otherwise 


(13) 


4.2  Intra-Relationship  Constraints 

An  Intra-relationship  constraint  defines  the  relationship  between  two  parameters  within  one  baseline 
set.  Assuming  parameters  within  one  baseline  set  Q  =  {(i,j)}  are  #yi,  ...9ijri,  which  can  be 
partitioned  into  Q  =  A  U  B  U  C,  where  A  =  {ap\p  =  1,  2, ...,  Sq},  B  =  {bp\p  =  1, 2, ...,  Sq}, 
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such  that  hp(0)  =  Oijap  —  Oijbp  <  0,  for  1  <  p  <  Sq,  and  C  is  the  set  of  parameters  without 
intra-relationship  constraints,  the  closed  form  solution  for  parameter  9  ijk is  as  follows: 


”ijk 


nijap  ~\~nijbp 


2  Ni, 


Tljj  k 
Na 


if  k  —  dp  or  bp  and  nijap  >  nijbp 
Otherwise 


(14) 


where  Ntj  =  l  nijk •  The  derivation  is  similar  to  the  one  in  Niculescu  et  al.  [11], 


4.3  Inter-Relationship  Constraints 


An  Inter-relationship  constraint  defines  the  constraint  applied  on  two  parameters  0  +a  and  9i"j"b 
from  different  baseline  sets  Q  4  and  Q  B ,  thus  the  subproblem  for  parameters  with  an  inter¬ 
relationship  constraint  is  applied  on  a  combined  parameter  set  Q  =  Q  4  LJ  Q  /j ,  where  baseline 
set  Qa  =  {{i' ,j')}  and  baseline  set  Qb  =  {(i",  j")},  such  that  h{9)  =  0-i/j'a  —  9i"j"b  <  0.  Let 
Na  —  j\£=ri A  ‘ft'ijki  Nq  —  ‘biijki  na  —  n y ai  and  nb  —  rii'j'i).  The  closed  form 

solution  for  parameters  with  inter-relationship  constraint  is  as  follows.  If  naNB  —  NAnb  >  0 


Else 


'ijk  — 


n.+n-t 

Na+Nb 

f  1  _  n„+ni,  \  "i.jfr 
I1  Na  +  Nb  >  N A  —  na 

{ 1  _  Kg  \  nijk 

I1  Na+Nb  >  NB-nb 


ijk  =  i'j'a  or  i" j"b 
(i,j)  G  Qa  and  k  ^  a 
(i,j)  &  Qb  and  k  ^  b 
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ijk  — 


^  (i,j)€QB 


(15) 


(16) 


The  brief  derivation  of  the  solution  is  as  follows.  The  KKT  conditions  are: 


V0M0)  -  A AgA{0)  -  A b9b{0)  -  nh[9)\  =  0  (17) 


9a(0)  =  0  gB(0)  =  0 

h{0)  <0  fi  >  0  fi*  h{9)  =  0 

From  the  first  equation  of  KKT  conditions  (Eq.(18)),  we  can  obtain: 
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ijk  — 


Tlijk 

Ab  —  ju 
‘kl'i  j  k 

\a 

ij  k 

Ab 


ijk  —  i'j'a 

ijk  =  i"  j"b 

(i,j)  £  Qa  and  k  ^  a 

(i,j)  £  Qb  and  k  ^  b 


(18) 


(19) 


Two  cases  are  considered,  depending  on  whether  the  inter-relationship  constraint  is  active  or  not: 


•  Case  1  :h(9)  =  0  and  fi>  0 

We  can  solve  A^t,  As,  and  /j  with  the  following  equations: 


na  _  rib  _  Tiq+nb 

A  a+M  A  b— M  Aa+Ab 

n„  1  Na  —n„.  i 

Aa+M  Aa 

rih  1  Nn—nh  1 

+  Afl  -  1 


Ab  — M 


(20) 


The  first  equation  is  h(6)  =  0,  the  second  and  the  third  are  from  g A(0)  =  0,  and  gsiO)  = 
0.  Also,  from  /r  >  0,  we  can  get  naNs  —  NAnb  >  0.  In  this  way,  we  obtain  the  first  part 
of  closed  form  solution  (Eq.(16)). 

•  Case  2:  h(9)  <  0  and  g  =  0 

It  is  equivalent  to  the  case  that  no  inequality  constraints  are  applied.  From  g  a(9)  =  0,  we 
can  get  XA  =  J2{i,j)eQA  rHifc  =  ^ A •  Similarly,  we  can  get  A b  =  NB.  Plug  them  into 
Eq.(18),  we  can  obtain  the  second  part  of  closed  form  solution(Eq.(16)).  From  h  (0)  <  0, 

we  get  naNB  -  NAnb  <  0. 
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5  Evaluation  with  Synthetic  Data 


In  order  to  test  the  performance  of  our  method  against  ML  estimation  and  the  standard  EM  algorithm 
given  sparse  data  and  incomplete  data  respectively,  we  test  the  algorithms  on  multiple  BNs  with  the 
same  number  of  nodes  of  20,  but  different  randomly  generated  initial  parameters  and  structures.  For 
one  specific  BN  structure,  1 1  BNs  with  different  initializations  of  parameters  are  generated.  One  of 
them  is  treated  as  the  ground  truth,  and  10  others  as  different  initializations  for  parameter  learning. 
For  the  case  of  sparse  data,  700  samples  are  generated  from  the  ground  truth  BN,  200  for  testing  data 
and  the  remaining  500  for  training.  For  the  case  of  incomplete  data,  400  samples  are  drawn  from  the 
ground  truth  BN,  half  for  training,  half  for  testing,  and  all  training  data  associated  with  hidden  nodes 
are  removed.  To  produce  the  needed  constraints,  for  the  case  of  sparse  data,  we  randomly  choose 
a  subset  of  parameters  from  all  parameters,  and  impose  constraints  on  the  selected  parameters.  For 
the  case  of  incomplete  data,  we  randomly  choose  parameters  from  only  parameters  for  the  hidden 
nodes,  and  impose  constraints  on  them.  The  number  of  constraints  in  a  CPT  is  no  more  than  2.  For 
performance  characterization,  the  Kullback-Leibler  (K-L)  divergence  is  used,  which  measures  the 
distance  between  the  learned  parameters  and  the  ground  truth. 

With  complete  but  sparse  data,  we  compare  the  learning  performance  of  ML  estimation  with 
our  method  with  range  constraints,  intra-relationship  constraints  and  inter-relationship  constraints 
respectively,  as  shown  in  Figure  1.  We  can  see  that  CML  is  better  than  ML  estimation  in  both 
mean  and  standard  deviation  of  KL-divergence.  More  specifically,  the  mean  K-L  divergence  for  ML 
estimation  is  0.2087,  which  decreases  to  0.0786  for  CML  with  range  constraints,  0.1763  for  CML 
with  intra-relationship  constraints,  and  0.1546  for  CML  with  inter-relationship  constraints. 


Figure  1 :  Sparse  Data  Learning  Results  Comparisons  w.r.t  K-L  divergence:  ML  estimation  vs.  CML. 
(a)  range  constraints;  (b)  intra-relationship  constraints;  (c)  inter-relationship  constraints. 


With  incomplete  data,  we  compare  the  learning  performance  of  our  method  with  standard  EM 
method  as  shown  in  Figure  2.  The  average  K-L  divergence  of  hidden  nodes  decreases  from  0.6437 
for  EM  to  0.2361  for  CEM  with  range  constraints,  0.3830  for  CEM  with  intra-relationship  con¬ 
straints,  and  0.4864  for  CEM  with  inter-relationship  constraints.  The  improvements  are  especially 
significant  for  the  hidden  nodes  (nodes  13  to  20). 


Range  Constraints  Intra  Relationship  Constraints  Inter  Relationship  Constraints 


Figure  2:  Incomplete  Data  Learning  Results  Comparisons  using  w.r.t  K-L  divergence:  EM  vs.  CEM. 
(a)  range  constraints;  (b)  intra-relationship  constraints;  (c)  inter-relationship  constraints. 
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6  Conclusion 


Qualitative  domain  knowledge  generally  exists  in  applications.  We  define  two  types  of  constraints 
to  represent  the  qualitative  domain  knowledge,  and  derive  closed  form  solution  for  the  maximum 
likelihood  parameter  estimator  with  the  two  types  of  constraints  respectively.  For  the  case  of  sparse 
data,  we  directly  apply  our  constrained  maximum  likelihood  estimator,  while  for  incomplete  data, 
we  extend  EM  method  by  replacing  M  step  with  our  constrained  maximum  likelihood  estimator.  The 
experimental  results  from  synthetic  data  demonstrate  that  our  method  can  fully  exploit  the  domain 
knowledge  to  improve  parameter  learning  accuracy. 
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Abstract 


This  report  summarizes  the  preliminary  work  we  have  done  to 
evaluate  the  effectivness  of  modeling  and  evaluating  an  EBO-based 
military  planning  using  the  Influence  Diagram  (ID).  It  includes  con¬ 
struction  of  an  ID  model,  its  parameterization,  and  evaluation  of  the 
model  using  different  action-selection  strategies. 


1  Introduction 

A  military  plan  includes  various  military  actions,  the  effects  of  these  actions, 
as  well  as  sensory  observations  that  assess  the  effectiveness  of  the  actions. 
In  addition,  a  miliary  plan  also  includes  its  goal  as  well  as  subgoals  that  are 
needed  to  achieve  the  goal.  For  effects-based  military  planning,  we  need  a 
model  that  can  model  different  factors,  their  uncertainties,  and  their  depen¬ 
dencies.  In  addition,  the  model  should  have  a  mechanism  that  can  propagate 
the  action  effects,  their  uncertainties,  and  their  dynamics.  To  meet  these 
requirements,  a  probabilistic  framework  based  on  the  dynamic  Influence  Di¬ 
agram  (DID)  is  used. 

An  Influence  Diagram(ID)  is  a  directed  acyclic  graph  which  consists  of 
random  nodes,  decision  nodes,  and  the  value  (utility)  nodes.  Here,  the  sen¬ 
sors  and  the  actions  are  represented  by  the  decision  nodes.  The  action  effects, 
the  goal  and  subgoals,  and  the  observations  are  represented  by  the  random 
nodes.  The  utility  of  a  sensor  or  an  action  is  represented  by  the  utility  nodes. 
Figure  1  shows  a  generic  ID  for  EBO-based  planing  modeling.  Figure  2  shows 
the  application  of  the  model  to  a  specific  military  planning  problem. 

2  Model  Description 

2.1  Structure  of  the  Model 

Figure  3  shows  an  example  ID  structure  for  a  generic  military  planning  prob¬ 
lem.  The  planning  problem  consists  of  4  military  actions,  a  goal,  and  several 
subgoals.  In  total,  there  are  33  nodes  in  this  ID,  and  all  the  random  nodes 
and  decision  nodes  are  binary.  Specifically,  circular  nodes  represent  the  ran¬ 
dom  variables,  rectangular  nodes  are  the  decision  variables,  and  the  diamond 
nodes  are  the  utility  variables.  G  node  depicts  the  goal  of  the  military  plan 
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Observation- assessment  of  effect  and  related  contextual  information,  e.g.  weather,  terrain,  intelligences,  etc.. 
Sensing-an  operation  to  acquire  the  evidence  such  as  aerial  reconnaissance  or  ground  surveillance.  - 


Figure  1:  A  generic  influence  diagram  for  EBO  modeling 


to  achieve.  G0  node  is  the  goal  node  in  the  last  time  slice.  SG  and  SSG  rep¬ 
resent  the  sub-goal  and  sub-subgoal  that  are  needed  to  achieve  the  goal.  The 
action  effects  are  represented  by  the  E  nodes.  The  A  nodes  represents  the 
military  actions,  the  sensing  operations  that  assess  the  miliary  action  effects 
are  represented  by  the  S  nodes.  The  sensory  observations  resulted  from  the 
sensing  operations  are  represented  by  the  O  nodes.  Finally,  the  diamond  U 
nodes  represent  the  action  utilities. 

2.2  Parameters  of  the  Model 

Given  the  topology  of  the  model,  the  next  task  it  to  parameterize  the  model. 
Model  parametrization  requires  to  specify  the  conditional  probabilities  of 
each  node  given  each  configuration  of  its  parents.  Model  parametrization 
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Figure  2:  An  example  DID  for  EBO-based  military  plan  modeling  and  as¬ 
sessment,  where  the  circular  nodes  at  the  bottom  represent  various  actions 
and  the  rectangular  nodes  the  action  effects. 

can  be  done  automatically  through  a  learning  method  or  manually  by  a 
domain  expert.  Here,  we  manually  set  the  parameters. 

Specifically,  Table  1  summarizes  the  conditional  probability  tables  (CPTs) 
of  the  model  shown  in  Fig.  3.  Table  1  (a)  provides  the  CPTs  for  the  effect 
nodes  E,  table  1  (b)  provides  the  CPTs  for  the  observation  nodes  O,  and 
table  1  (c)  consists  of  the  CPTs  for  the  goal  node.  Each  node  has  two  val¬ 
ues,  with  1  being  false  and  2  being  true.  In  addition,  a  random  number  r 
between  -0.05  and  0.05  is  added  to  the  probabilities  to  vary  the  effects  of 
different  actions. 

We  also  need  specify  the  parameters  for  the  utility  nodes.  Table  2  shows 
the  utility  functions  associated  with  the  action  and  sensor  nodes.  The  utility 
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Figure  3:  A  Dynamic  Influence  Diagram  model  for  a  generic  military  plan¬ 
ning. 


of  a  military  action  is  measured  by  its  contribution  to  achieving  the  military 
goal  or  subgoal  while  the  utility  of  a  sensing  action  is  measured  by  the  tradeoff 
between  its  cost  and  its  contribution  to  understanding  the  effect  of  an  action. 
These  utility  values  vary  from  action  to  action  and  they  also  vary  over  time. 
For  this  study,  their  values  are  randomly  assigned. 

The  parameters  of  all  the  nodes  are  assigned  based  on  the  following  heuris¬ 
tic  rules: 

1.  If  there  is  no  action,  then  there  is  no  effect.  Thus,  when  an  A  node 
equals  to  false,  the  probability  of  the  corresponding  E  node  being  True 
is  zero. 

2.  We  assume  that  the  sensors  can  effectively  detect  the  effect  of  an  action, 
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which  means  that  if  there  is  an  effect,  the  probability  of  observation 
being  true  is  0.7. 

3.  With  the  accomplishment  of  both  subgoals,  the  chance  of  goal  node 
equals  to  true  is  also  high.  On  the  other  hand,  if  either  of  the  subgoals 
is  achieved,  the  chance  of  goal  node  being  true  is  low.  As  the  two  sub- 
goals  may  be  of  different  importance  in  accomplishing  the  goal,  the 
probability  of  goal  given  each  subgoal  is,  therefore,  different. 

4.  After  every  time  slice,  the  probability  of  the  goal  node  at  next  time 
slice  will  be  initialized  as  the  value  of  G0  node.  At  the  initial  time 
slice,  the  probability  of  G0  node  being  true  is  0.5. 


3  Model  Evaluation 

Figure  4  provides  the  flowchart  that  we  use  to  evaluate  the  model.  Basically, 
the  process  repeats  like  this.  At  first,  we  initialize  the  prior  probability  of 
goal  attainment  to  be  0.5,  and  then  begin  to  iteratively  choose  the  best  plan 
until  the  probability  of  achieving  the  goal  is  high  enough  or  the  net  expected 
utility  of  the  selected  plan  is  too  low.  The  two  thresholds  can  be  set  according 
to  the  real  conditions. 

During  each  iteration,  if  the  probability  of  achieving  the  goal  is  below  a 
threshold,  we  proceed  to  identify  a  military  plan  that  can  maximally  improve 
the  chance  of  the  goal  attainment.  Once  the  plan  is  identified,  the  actions 
of  the  plan  are  executed,  and  the  action  effects  are  propagated.  This  is 
then  followed  by  activating  the  corresponding  sensory  operations  to  assess 
the  effects  the  actions.  The  acquired  sensory  observations  are  propagated 
through  the  model  to  update  its  joint  probabilities.  This  propagation  also 
updates  the  probability  of  the  goal  node.  The  process  then  repeats.  In 
summary,  the  process  includes  the  following  steps: 

•  The  first  step  is  to  infer  probability  of  goal  node  to  decide  whether  to 
take  further  actions  or  to  stop. 

•  If  in  the  first  step,  we  did  not  stop,  we  identify  the  best  plan,  execute 
its  constituent  actions,  and  propagate  their  effects. 
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Tabic  1:  Conditional  Probability  Tables:  (a)  effect  nodes;  (b)  observation 
nodes;  (c)  the  goal  node 


P(Ei\Ai)(i  =  1,2, 3, 4) 

Ei  =  1  Ei  =  2 

A  =  1 

A  =  2 

1  0 
l-P(Ei  =  2\Ai  =  2)  0.7+r 

(a) 


P{0\E) 

0=1  0=2 

E=1 

E=2 

0.8  0.2 

0.3  0.7 

(b) 


Table  2:  Utility  functions:  (a)and  (b)  action  utilties  with  respect  to  goal;  (c) 
costs  of  sensing  operations. 
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Input 


Initialize 


Figure  4:  Flowchart  of  the  algorithm 


•  The  third  step  is  to  obtain  the  observations,  which  provides  an  assess¬ 
ment  of  the  results  of  action  effects,  and  to  propagate  the  observations 
to  update  the  goal  attainment  probability. 


4  Action  Selection  Methods 

To  form  a  military  plan,  we  need  identify  its  constituent  actions.  There  are 
three  kinds  of  criteria,  on  which  actions  are  selected. 

1.  Only  one  action  is  chosen  each  time.  Two  algorithms  are  implemented 
here.  Algorithm  brute  force  1  chooses  the  best  action  which  maximizes 
the  net  expected  utility (i.e.,Es(U)  in  Eq.  1).  Algorithm  random  1 
randomly  chooses  one  action  each  time. 

2.  Obtain  a  plan  consisting  of  a  set  of  actions,  that  maximizes  the  net 
expected  utility  without  considering  action  costs.  Three  algorithms 
are  implemented  here:  brute  force,  greedy,  and  random  selection. 

3.  Obtain  a  plan  to  maximize  the  net  expected  utility  while  the  cost  of  the 
plan  is  under  a  given  budget  limit.  Three  algorithms  are  implemented 
here:  brute  force,  simple  selection,  and  the  random  selection.  All  are 
subject  to  an  upper  budget  limit. 

The  expected  utility  of  a  plan  is  defined  as 

E5  =  ES(U)  -  E"=1C/(Aj)  (1) 

where  the  first  item  represents  the  net  expected  utility  (Eq.  2)  of  the  plan 
5,  which  consists  of  a  series  of  actions.  The  second  item  is  the  cost  of  the 
plan,  which  consists  of  costs  of  all  the  actions  involved  in  the  plan,  n  is  the 
number  of  actions  that  can  be  chosen  in  a  plan. 

ES(U)  =  £  P(G  =  i|<$)  *  U(G  =  i)  (2) 

i=  1 

4.1  One  Action  for  Each  Plan 

4.1.1  Brute  Force  1 

Brute  force  1  finds  the  best  action  based  on  the  net  expected  utility.  Table 
3  summarizes  the  algorithm. 
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Tabic  3:  Brute  Force  Action  Selection  Method 


Step  1:  List  all  the  candidate  actions. 

Step  2:  Compute  the  net  expected  utility  for  each  action. 
Step  3:  Find  the  best  action  which  maximizes  E$(U). 


4.1.2  Random  1 

Random  1  randomly  selects  an  action  from  the  candidate  actions  every  time. 

4.2  Multiple  Actions  for  Each  Plan:  No  Cost 

4.2.1  Brute  Force 

Brute  force  can  find  the  optimal  plan  with  the  largest  net  expected  utility. 
There  is  no  limit  on  the  number  of  actions  in  a  plan.  The  algorithm  is 
described  in  Table  4. 

Table  4:  Brute  Force  Action  Selection  Method 


Step  1:  List  all  the  possible  plans  (combinations  of  actions). 
Step  2:  Compute  the  net  expected  utility  for  each  combination. 
Step  3:  Find  the  best  plan  which  maximizes  E$(U) 


4.2.2  Multiple  Actions  for  Each  Plan:  Greedy  Method 

For  a  large  model  with  many  action  nodes,  it  will  take  much  time  to  find  the 
global  optimal  plan  using  a  brute  force  method.  Greedy  method,  which  costs 
less  time,  is  an  alternative  to  brute  force  method  to  find  a  local  optimal  plan 
efficiently.  Table  5  summarizes  the  greedy  action  selection  algorithm. 
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Table  5:  Greedy  Action  Selection  Method 


Step  1:  Compute  the  net  expected  utility  for  each  solo  action. 

Step  2:  Find  the  best  action  which  maximizes  the  net  expected  utility,  and  add 
it  to  a  plan. 

Step  3:  Find  an  action  in  the  remaining  actions  which  can  maximize  the  net 
expected  utility,  when  combined  with  the  chosen  actions  as  a  new  plan 
Step  4:  This  process  repeats  until  the  net  expected  utility  of  the  plan  peaks. 


4.2.3  Multiple  Actions  for  Each  Plan:  Random  Selection 

To  compare  with  the  brute  force  and  the  greedy  method,  we  also  implement 
the  random  selection  method.  For  the  random  selection  method,  we  confine 
the  maximum  number  of  actions  selected  by  the  random  algorithm  to  the 
maximum  number  of  actions  selected  by  the  brute  force  algorithm.  Table  6 
summarizes  the  process  of  random  selection. 

Table  6:  Random  Action  Selection  Method 


Step  1:  Identify  the  number  of  actions  selected  by  the  bruce  force  method. 
Step  2:  Randomly  choose  the  same  number  of  actions  to  form  a  plan. 


4.3  Multiple  Actions  for  Each  Plan  with  a  Budget  limit 

4.3.1  Brute  Force  with  a  budget  limit 

Brute  force  with  a  budge  limit,  which  is  described  in  Table  7,  finds  a  plan  to 
maximize  the  net  expected  utility  subject  to  a  budget  limit. 

4.3.2  Simple  Method  with  a  budget  limit 

Simple  method  with  a  budget  limit  simply  chooses  the  first  several  actions 
whose  total  cost  is  under  the  given  budget  limit.  The  algorithm  is  summa- 
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Tabic  7:  Brute  Force  Action  Selection  Method 


Step  1:  List  all  the  possible  plans  (combinations  of  actions). 

Step  2:  Compute  the  cost  of  all  the  plans  respectively.  If  the  cost  of  a  plan  is 
beyond  the  given  budget  limit,  remove  it  from  the  possible  plans. 

Step  2:  Compute  the  net  expected  utility  for  the  remaining  plans  respectively. 
Step  3:  Find  the  best  plan  which  maximizes  the  net  expected  utility. 


rized  in  Table  8. 

Table  8:  Simple  Action  Selection  Method  with  a  Budget  Limit 


Step  1:  Compute  the  net  expected  utility  for  each  solo  action,  and  sort  the 
actions  in  descending  order. 

Step  2:  Choose  the  first  several  actions  whose  sum  of  costs  is  under  the  given 
budget  limit. 


4.3.3  Random  Selection  with  a  budget  limit 

Random  selection  with  a  budget  limit  randomly  selects  a  plan  whose  cost  is 
under  the  budget  limit.  The  algorithm  is  summarized  in  Table  9. 


Table  9:  Random  Action  Selection  Method 


Step  1:  Randomly  choose  several  actions  as  a  plan. 

Step  2:  Check  whether  the  cost  of  the  plan  is  beyond  the  budget  limit.  If  so, 
go  to  step  1.  Otherwise  terminate  the  program  and  output  the  plan. 
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5  Experiment  Results 

To  test  the  various  action  selection  methods  mentioned  above,  we  conducted 
three  experiments.  All  the  experiments  are  based  on  the  model  shown  in  Fig. 
3.  Since  the  observation  nodes  are  instantiated  randomly,  the  probability  of 
goal  attainment  also  varies.  Hence,  we  perform  each  experiment  20  times  to 
obtain  the  average. 

5.1  Experiment  1:  one  action  for  each  plan 

In  the  first  experiment,  one  action  is  chosen  every  time.  The  experiment 
results  are  shown  in  Fig.  5.  Figure  5  (a)  is  the  net  expected  utility  of  the 
selected  plan  in  each  time  slice,  and  (b)  is  the  corresponding  probability  of 
goal  attainment.  The  median  of  each  bar  is  the  mean  value,  and  the  height 
of  the  bar  is  the  standard  deviation,  which  is  obtained  from  20  experimental 
results. 

The  algorithm  marked  by  ‘bestl’  is  brute  force  1  algorithm,  which  chooses 
one  best  action  based  on  the  net  expected  utility  criterion,  and  the  algorithm 
marked  by  ‘randoml’  chooses  one  action  randomly.  We  can  see  from  the 
figures  that  the  mean  of  the  net  expected  utility  of  brute  force  1  is  greater 
than  random  1  in  each  time  slice  except  for  the  start  point,  when  they  are 
equal. 

5.2  Experiment  2:  Multiple  Actions  without  Cost 

In  this  experiment,  we  test  the  algorithm  brute  force,  greedy  and  random 
action  selection.  The  first  two  algorithms  choose  actions  based  on  the  net 
expected  utility.  The  random  selection  selects  actions  randomly.  We  confine 
the  maximum  number  of  actions  selected  by  the  random  algorithm  to  be 
the  same  as  the  maximum  number  of  actions  selected  by  the  brute  force 
algorithm. 

Figure  6  shows  the  experiment  results.  Figure  6  (a)  is  the  net  expected 
utility  in  each  time  slice,  and  (b)  is  the  corresponding  probability  of  goal 
attainment.  We  can  see  from  the  figures  that  the  mean  net  expected  utility 
of  the  plan  using  the  brute  force  and  greedy  methods  is  greater  than  that 
of  the  random  method  in  each  time  slice  except  for  the  starting  point,  when 
they  are  all  the  same.  The  experiment  result  of  brute  force  is  close  to  that 
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One  action  once  One  action  once 


time  slice  time  slice 


(b) 


Figure  5:  Experiment  1  Results:  (a)  Expected  utility  of  the  selected  plan 
at  each  time  instant;  (b)  Probability  of  goal  attainment  after  executing  the 
selected  plan  at  each  time  instant. 
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of  greedy  algorithm,  while  in  some  time  slice,  it  is  a  little  better  than  the 
greedy  method.  Computationally  however,  the  greedy  method  is  much  faster 
than  the  brute  force  method.  Finally,  the  height  of  the  error  bar,  which 
represents  the  standard  derivation,  shows  the  impact  of  observation  variation 
on  the  experimental  results.  If  the  observations  are  false,  it  may  decrease  the 
value  of  a  plan,  and  on  the  other  hand,  if  they  are  true,  it  may  increase  the 
performance  of  a  plan. 


time  slice 


Criterion:  Net  Expected  Utility 


(b) 


Figure  6:  Experiment  2  Results:  (a)  Expected  utility  of  the  selected  plan, 
(b)  Probability  of  goal  attainment  after  executing  the  selected  plan  at  each 
time  instant. 


5.3  Experiment  3:  Multiple  Actions  with  Cost 

In  this  experiment,  we  test  the  brute  force,  simple  method  and  random  action 
selection  methods,  all  under  a  budget  limit  on  cost  of  the  selected  plan.  The 
fist  two  algorithms  choose  actions  to  maximize  the  net  expected  utility  given 
a  budget  limit.  The  random  selection  randomly  selects  a  plan  under  the 
given  budget  limit.  Figure  7  shows  the  experiment  results.  Similar  to  the 
experiment  results  above,  Figure  7  (a)  is  the  net  expected  utility  of  the 
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selected  plan  for  each  time  slice,  and  (b)  is  the  corresponding  probability  of 
goal  attainment. 

To  maximize  the  expected  utility,  we  are  to  maximize  the  probability  of 
goal  attainment.  We  can  see  from  (b)  that  both  simple  method  and  brute 
force  perform  are  much  better  than  the  random  method  in  goal  attainment. 
Brute  force  with  a  budget  limit  achieves  the  threshold  most  quickly,  followed 
by  the  simple  method.  Random  method  with  a  budget  limit  takes  the  longest 
time  to  reach  the  threshold,  and  sometimes  it  never  achieves  it.  The  simple 
method,  compared  with  the  brute  force  method,  is  computationally  much 
more  efficient. 

Table  10  shows  the  steps  taken  by  the  three  algorithms  respectively  in  one 
experiment.  Index  is  the  number  of  time  slice.  Index  0  is  the  initial  point. 
Actions  are  the  index  of  actions  chosen  in  the  time  slice.  Observations  are 
the  observation  results.  0  means  no  observation,  1  means  false  and  2  means 
true. 


Criterion:  Expected  Utility  with  a  Budget  limit  Criterion:  Expected  Utility  with  a  Budeget  limit 


time  slice  time  slice 

(a)  (b) 


Figure  7:  Experiment  3  Results:  (a)  Expected  utility  of  the  selected  plan; 
(b)  Probability  of  goal  attainment 
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Table  10:  Budge  limit  cases:  (a)  Brute  force  method;  (b)  Simple  method;  (c) 
Random  selection  method 


Index 

Actions 

Observations 

P(G  =  2) 

0 

[] 

0.532 

1 

[13  4] 

0.805 

2 

[12  3] 

0.900 

(a) 


Index 

Actions 

Observations 

P(G  =  2) 

0 

0 

[0  0  0  0] 

0.532 

1 

[14  2] 

[1  2  0  2] 

0.8284 

2 

[1  3] 

[10  10] 

0.8476 

3 

[1  3] 

[1  0  2  0] 

0.8826 

(b) 


Index 

Actions 

Observations 

P(G  =  2) 

0 

0 

[0  0  0  0] 

0.532 

1 

[2  3] 

[0  12  0] 

0.7505 

2 

[2  4] 

[0  2  0  1] 

0.8456 

3 

[4] 

[0  0  0  2] 

0.849 

4 

[4] 

[0  0  0  2] 

0.8495 

5 

[2] 

[0  10  0] 

0.8315 

6 

[1] 

[2  0  0  0] 

0.8531 
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5.4  Conclusion 

This  preliminary  study  demonstrates  the  promise  of  the  influence  diagram 
for  EBO-based  miliary  modeling  and  assessment.  Specifically,  for  modeling, 
an  ID  can  effectively  represent  actions,  their  effects,  their  relationships,  and 
their  uncertainties.  In  addition,  through  belief  propagation,  action  effects 
and  their  uncertainties  can  be  systematically  propagated  through  the  model. 

For  plan  assessment,  we  perform  three  experiments  to  evaluate  the  perfor¬ 
mance  of  three  different  action  selection  strategies  in  terms  of  their  optimality 
and  efficiency  in  action  selection.  Specifically,  in  the  first  two  experiments, 
net  expected  utility  is  used  as  the  criterion,  so  the  experiment  results  show 
that  the  action  selection  methods,  such  as  brute  force  and  greedy  method, 
are  better  than  the  random  algorithm  in  net  expected  utility  in  each  time 
slice.  But  their  performance  with  respect  to  the  goal  attainment  appear  to  be 
comparable.  This  needs  to  be  further  investigated.  In  the  third  experiment, 
the  net  expected  utility  is  used  with  a  budget  limit.  As  the  utility  function 
for  the  goal  never  changes  over  time,  the  probability  of  goal  attainment  is 
maximized  with  the  maximization  of  the  expected  utility.  From  the  exper¬ 
iment  3  results,  we  can  conclude  that  that  both  brute  force  with  a  budget 
limit  and  simple  method  perform  better  than  random  selection  with  a  budget 
limit,  helping  achieving  the  plan  goal  faster. 

While  the  experiments  identify  a  few  issues,  the  experimental  results  re¬ 
main  preliminary,  and  further  studies  and  analysis  will  be  needed. 


6  Appendix:  An  efficient  method  for  com¬ 
puting  the  expected  utility 

In  this  section,  we  try  to  find  a  method  to  improve  computational  complexity 
of  the  action  selection  algorithm.  Regardless  of  the  action  selection  method 
we  use  (except  for  the  random  method),  we  need  the  probability  of  goal 
attainment  to  compute  the  expected  utility  or  the  net  expected  utility.  In 
the  action  selection  methods  described  above,  we  need  to  perform  inference 
in  every  time  slice  to  obtain  the  probability  of  goal  attainment.  But  if  we  can 
find  a  more  efficient  way  to  compute  the  probability  of  goal  attainment,  we 
may  decrease  the  computational  complexity  of  the  action  selection  method. 

A  general  ID  for  military  models  is  shown  in  Figure  8.  There  are  n  actions 
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and  m  subgoals  in  the  model.  The  structure  of  model  inside  the  dotted  lines 
can  be  arbitrary,  which  means  we  will  not  consider  the  specific  number  of 
nodes  or  lines  in  the  region. 


Figure  8:  A  general  ID  for  military  plan  modeling 


As  we  can  see  from  the  general  military  model  above,  the  only  parameter 
that  changes  over  time  is  the  probability  of  goal  in  last  time  slice.  As  the 
structure  and  other  parameters  remain  the  same  over  time,  we  may  factorize 
the  constant  coefficients  for  the  fixed  structure  so  that  they  only  need  be 
computed  once. 

Specifically,  given  the  structure  described  in  Figure  8,  we  can  derive  the 
probability  of  goal  attainment  as  follows: 
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P(G  =  2\A) 


=  E  P(Cr\G0,A)P(G0\A) 

G0=l 

=  jz  P(G\G0,A)P(G0) 

G0=l 

=  E  Ep(G'lG'o,5G')P(G'o)P(5G|A) 

G0=l  SG 

=  x*Y,p(G\Go  =  2,SG)*P(SG\A) 

SG 

+  (1  —  &)  *  E  P{G\Gq  =  1,  SG)  *  P(SG\A)  (3) 

SG 


where, 

A  =  {Ai, An} 

SG  =  {SGu...,SGm} 
x  =  P{Gq  =  2) 

where  A  represents  all  the  action  variables.  SG  represents  all  the  subgoals 
connected  to  the  goal,  n  and  m  are  the  total  number  of  actions  and  the 
number  of  subgoals  respectively. 

Based  on  the  equations  above,  we  can  see  that  the  probability  of  goal 
attainment  given  different  actions  only  depends  on  x ,  because  other  factors 
in  the  equation  are  constant,  which  will  not  change  over  time.  We  can  only 
compute  each  combination  of  the  constant  factor  once  when  we  first  use 
them,  and  use  them  again  later  without  recomputing  them  again.  This  will 
lead  to  computational  saving. 
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1  Introduction 


An  important  issue  in  effects-based  operation  (EBO)  is  evaluation  of  the  effects  of  a  military 
plan.  A  military  plan  consists  of  a  set  of  selected  actions  and  their  execution  timings.  For 
each  available  action,  a  planner  must  decide  whether  to  include  it  in  the  plan.  In  making 
these  choices,  the  planner’s  objective  is  to  identify  a  plan  that  maximizes  the  probability  of 
achieving  the  military  objectives  while  minimizing  the  associated  costs.  As  the  number  of 
possible  plans  increases  exponentially  with  the  number  of  actions,  determining  the  optimal 
plan  via  an  exhaustive  search  of  the  plan  space  becomes  computationally  intractable  and 
practically  infeasible.  We  propose  a  factorization  procedure  to  significantly  reduce  the  evalu¬ 
ation  time  for  each  plan  by  exploiting  the  common  computations  associated  with  evaluating 
each  plan.  Experimental  study  of  our  method  shows  that  it  can  often  reduce  the  evalua¬ 
tion  time  by  orders  of  magnitude.  In  addition,  the  proposed  method  offers  exact  solution, 
therefore  avoiding  the  convergence  problem  plaguing  the  commonly  employed  approximate 
inference  methods  such  as  logic  sampling. 


2  Proposed  Solution 

We  model  the  military  planning  problem  using  an  Influence  Diagram  (ID)  as  shown  in  Figure 
1.  Different  from  causal  model  in  [1],  an  influence  diagram  explicitly  includes  the  action 
nodes  and  the  value  nodes.  The  primary  military  actions  are  represented  by  the  rectangular 
nodes  Ay  (i  £  {1, .  .  . ,n})  while  the  top  node  H  represents  the  military  goal.  Each  action 
node  Xi  is  associated  with  a  value  node  Dp  encoding  the  cost  of  performing  the  action.  The 
cost  for  an  action  includes  physical  cost,  collateral  damages,  political  cost,  etc..  The  goal 
node  H  is  also  associated  with  a  value  node  U,  encoding  the  value  of  goal  attainment.  In 
general,  the  actions  do  not  influence  the  overall  goal  directly;  instead  their  consequences 
propagate  through  a  set  of  intermediate  subgoals/tasks  whose  realization  lead  to  the  goal 
success.  In  an  ID,  these  subgoals/tasks  are  represented  by  the  intermediate  circular  nodes 
Yj(j  £  {1, .  .  .  ,  mi}),  which  represents  the  j-th  subgoal  at  level  i  1  The  links  in  an  ID  specify 
three  classes  of  probabilistic  relations:  the  relation  between  the  action  nodes  and  lowest-level 
subgoals,  the  relation  among  the  many  subgoals,  and  the  relation  between  the  highest-level 
subgoals  and  the  overall  goal. 

Given  such  an  ID,  for  each  action  node  Xi,  a  plan  8  prescribes  an  action  choice  h(Ay)  of 
cii  or  -iap  where  a8-  stands  for  the  selection  of  action  Xi  and  ->ai  represents  otherwise.  Thus, 
a  plan  8  can  be  denoted  by  8  =  {h(Ah), . .  . ,  8{Xn)}.  Let  g(Ui\8(Xi))  and  g{U\H)  be  the  cost 
of  performing  action  Xi  and  the  value  of  goal  attainment  respectively,  then  the  expected  net 
utility  of  executing  a  plan  8  is  n 

Es  =  Es[U]-'£g(Ui\8(Xi)).  (1) 

8  =  1 

where  the  first  term,  Es[U]  =  J2h  P{H\8)g{U\H),  represents  the  expected  utility  of  the  goal 
and  the  second  term  is  the  combined  cost  of  executing  the  selected  actions.  The  optimal 
plan  h*  is  the  one  that  maximizes  E$  over  all  plans,  i.e., 

Wor  simplicity,  we  assume  that  the  subgoal  nodes  are  structured  in  three  levels,  although  our  factorization 
approach  generalizes  to  arbitrary  levels. 
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Figure  1:  A  casual  influence  diagram  modeling  the  military  planning  problem,  where  squares 
are  action  nodes,  circles  represent  subgoals/tasks  and  the  goal,  and  diamonds  are  value  nodes. 

S*  =  arg  max  E$  ( 2 ) 

8 

The  brute-force  approach  requires  to  evaluate  all  2n  plans  for  n  actions  to  identify  the 
optimal  plan.  In  this  report,  we  present  an  approach  that  can  significantly  reduce  the 
evaluation  time  for  each  plan. 


3  A  Factorization  Approach  to  Efficient  Plan  Evalua¬ 
tion 


We  propose  a  factorization  procedure  that  allows  to  efficiently  evaluate  each  plan  by  exploit¬ 
ing  the  common  computations  in  the  network.  The  method  allows  a  plan  to  be  evaluated 
often  in  orders  of  magnitude  time  less  than  the  normal  evaluation.  Specifically,  with  respect 
to  the  ID  in  Figure  1,  E$[U]  can  be  computed  as 


£j[c]  =  ££  £  £  n^.nWin-y, 


m  2 


H  Yl:r 


Y  2  Y  3 

1  l:m2  1  l:rri3 


Yl)PiH\YLMU\H). 


where  Y±.m.  denotes  {Iq*, .  .  . ,  1^.}  for  the  nodes  at  the  z-th  level. 
Exchanging  the  summing  order  among  variables,  we  have 


Eh  E),.,  E),..  n- ^(1-  |l-mi  )1F  ,/MT  V,y  .V  */A//  V,;  )g(U\H) 

n  ZUPms) 


Let  us  define  two  functions  fu  and  corresponding  to  the  terms  within  the  two  brackets. 
fu  =  Er3m„  n“2=1P(l^  |y;imi  )n™l1P(T|  |EL2,  F}*  )p(H\Yum3  )9{U\H) 


(3) 


fs  =  TYfiUP(YY8) 


(4) 
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With  these  two  functions,  we  rewrite  Es[U]  in  Equation  (3)  as 

Es[V\=  E  fvfs-  (5) 

Y 1  Y 1 

It  can  be  seen  that  the  function  fu  is  independent  of  the  plan  8  while  the  function  f$  is 
dependent  on  the  plan  8.  In  other  words,  the  computations  for  fu  need  to  be  computed 
only  once  but  can  be  used  across  all  plans.  These  computations  can  be  factored  out.  The 
factorization  plan  evaluation  can  be  implemented  as  follows 


1.  Pre-compute  the  quantities  fu  for  all  plans 

2.  For  each  plan  8 

compute  fs  for  each  assignment  of  Yf.m 
compute  Es[U]  by  Equation  (5) 

3.  Return  the  plan  h*  that  maximizes  Es[U] 


Table  1:  The  factorization  approach  to  ID  evaluation 


4  Experiments 

We  have  conducted  experiments  to  evaluate  the  performance  of  the  factorization  approach. 
In  our  experiments,  we  use  the  IDs  in  Figure  1  as  the  test  example.  The  CPTs  are  ran¬ 
domly  generated.  The  value  functions  for  value  nodes  are  manually  specified.  The  code  is 
written  in  Matlab  V6.5  and  runs  in  a  laptop  with  a  2.0  GHz  CPU  under  Windows  XP.  In 
our  experiments,  we  compare  the  factorization  approach  against  the  generic  brute-forced  ap¬ 
proach.  For  convenience,  we  refer  to  them  respectively  as  evalCS  (named  after  Computation 
Sharing)  and  evalBF  (named  after  Brute-Forced). 

To  see  how  the  performance  of  the  algorithms  vary  with  the  number  of  action  nodes,  we 
fix  the  number  of  subgoal  nodes  at  each  level  at  four  and  vary  the  number  of  action  nodes. 
Thus  the  static  ID  with  n  action  nodes  has  additionally  10  random  nodes  and  ra+1  value 
nodes.  We  ran  evalBF  and  evalCS  for  seven  problems  with  n  =  3,  5,...,  13.  The  timing 
data  are  presented  in  left  chart  of  Figure  2.  The  chart  gives  the  total  CPU  seconds  that  the 
algorithms  took  for  each  of  the  problems.  Note  that  the  vertical  axis  is  drawn  in  log-scale. 
The  solid  (dashed)  curve  is  for  evalCS(evalBF).  It  can  be  seen  that  evalCS  is  considerably 
more  efficient  than  evaBF.  For  instance,  by  our  collected  data,  for  n  =  9,  to  evaluate  512 
plans,  evalCS  took  3.51  seconds  while  evalBF  646.74  seconds;  for  n  =  13,  to  evaluate  8192 
plans,  evalCS  used  46.32  seconds  while  evalBF  9284.05  seconds. 

5  Conclusions  and  Future  Work 

In  this  report,  we  proposed  a  factorization  approach  to  significantly  reduce  the  time  for  the 
evaluation  of  a  military  plan.  Our  approach  identifies  the  common  computations  used  by  all 
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Figure  2:  Performance  comparison  of  evalBF  and  evalCS 


plans  and  factors  these  computations  out  so  that  they  only  need  be  computed  once.  This 
has  resulted  in  a  significant  computational  saving,  often  by  several  orders  of  magnitude. 
Experimental  results  validate  our  method. 

Despite  these  successes,  there  remains  work  to  do  for  the  factorization  procedure.  So  far, 
our  method  is  only  applied  to  static  ID.  Further  investigations,  theoretical  developments, 
and  experimental  validations  are  necessary  to  extend  our  method  to  handle  dynamic  net¬ 
work,  network  with  evidences,  and  other  complex  network  structures  including  those  highly 
non-layered  networks  and  those  with  utilities  attached  to  the  intermediate  nodes.  In  addi¬ 
tion,  we  lift'd  develop  the  theories  for  analytically  determining  the  upper  and  lower  bound 
performance  for  a  given  network,  without  explicitly  evaluating  any  plan.  This  allows  the 
planner  to  commence  plan  evaluation  to  identify  the  best  one  only  if  the  upper  bound  and 
lower  bound  performance  exceeds  his/her  expectation. 
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Abstract 


Graphical  models  (GMs)  such  as  Bayesian  Networks  (BN)  or  the  Influence  Di¬ 
agrams  (ID)  are  being  increasingly  applied  to  many  different  applications.  One 
bottleneck  in  using  GMs  is  that  learning  the  GM  model  parameters  often  requires 
a  relative  large  amount  of  training  data.  However,  in  real  life  and  for  many  applica¬ 
tions,  training  data  is  often  incomplete  or  sparse,  which  can  cause  low  learning  ac¬ 
curacy.  Incorporating  domain  knowledge  can  help  alleviate  this  problem.  Instead 
of  using  quantitative  prior  knowledge  as  used  by  most  of  the  existing  methods, 
this  paper  introduces  a  novel  learning  method  based  on  systematically  combining 
the  training  data  with  some  qualitative  knowledge. 

To  validate  our  method,  we  compare  it  with  the  Maximum  Likelihood  (ML)  es¬ 
timation  method  under  sparse  data  and  with  the  Expectation  Maximization  (EM) 
algorithm  under  incomplete  data  respectively.  The  experimental  results  show  that 
our  method  improves  the  parameter  learning  accuracy  significantly  compared  with 
both  ML  and  EM  algorithms. 


1  Introduction 


Among  all  the  issues  of  graphical  models,  parameter  learning  is  one  of  the  main  challenges.  Pa¬ 
rameter  learning  is  to  estimate  the  entries  of  the  conditional  probability  distributions  (CPDs)  given 
the  structure  of  a  model.  Many  learning  techniques  rely  heavily  on  training  data  [7].  Ideally,  with 
sufficient  data,  it  is  possible  to  learn  the  parameters  by  standard  statistical  analysis  like  maximum 
likelihood  (ML)  estimation.  In  many  real-world  cases,  however,  the  data  are  either  incomplete  or 
sparse,  which  can  cause  inaccurate  parameter  estimation.  Data  incompleteness  is  defined  as  missing 
of  data  for  some  parameters,  while  data  sparseness  means  the  amount  of  training  data  is  limited. 

When  data  are  incomplete,  Expectation-Maximization  (EM)  [3]  algorithm  is  often  used.  Most 
EM-based  methods  work  under  the  assumption  that  data  are  missing  at  random  (MAR),  which  means 
the  missing  values  can  be  estimated  by  the  observed  ones  in  some  way.  However,  when  data  are 
missing  completely  at  random  (MCAR),  e.g  data  of  hidden  nodes,  the  learned  parameters  could  be 
far  from  the  ground  truth.  The  reason  is  that  the  missing  data  do  not  even  depend  on  the  observed 
ones,  and  there  is  no  way  to  estimate  the  missing  data  only  from  the  observed  ones. 

In  our  paper,  we  propose  a  framework  to  solve  the  parameter  learning  problem  by  combining 
quantitative  data  and  domain  knowledge  in  the  form  of  qualitative  constraints.  Two  kinds  of  qual¬ 
itative  constraints  are  defined:  range  constraints  which  are  applied  to  individual  parameters;  and 
relationship  constraints  which  are  applied  to  pairs  of  parameters.  For  sparse  but  complete  data,  we 
solve  the  learning  task  by  reformulating  the  problem  as  a  constrained  ML  (CML)  problem.  For 
incomplete  data,  we  introduce  the  constrained  EM  (CEM)  by  adding  constraints  to  the  M  step,  and 
iteratively  solve  the  learning  problem.  In  addition,  we  provide  closed  form  solutions  to  both  CML 
and  CEM. 
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2  Related  Work 


We  have  already  discussed  that  one  of  the  shortcomings  of  EM  algorithms  is  that  it  can  easily  be 
trapped  in  a  local  maximum  when  data  are  MCAR.  Till  now,  there  are  many  different  methods 
to  help  EM  to  escape  from  the  local  maximum,  such  as  the  information-bottleneck  EM  algorithm 
[4],  data  perturbing  method  [5],  and  AI&M  procedure  [8],  These  methods  focus  on  improving  the 
machine  learning  techniques,  but  ignoring  the  useful  domain  knowledge. 

Domain  knowledge  can  be  classified  as  quantitative  and  qualitative  knowledge,  which  describe 
the  explicit  quantification  of  parameters,  and  approximate  characterizations  of  parameters  respec¬ 
tively.  Both  kinds  of  domain  knowledge  are  useful  for  parameter  learning.  While  the  quantitative 
knowledge  has  been  widely  used  in  the  form  of  prior  probability  distributions,  qualitative  constraints 
have  not  been  fully  exploited  in  parameter  learning  yet. 

Wittig  et  al.  [12]  present  a  method  to  integrate  qualitative  constraints  into  two  learning  al¬ 
gorithms,  APN  [9]  and  EM,  by  adding  violation  functions  as  a  penalty  term  to  the  log  likelihood 
function.  They  show  that  domain  knowledge  in  the  form  of  constraints  can  improve  learning  accu¬ 
racy.  However,  this  penalty-based  method  cannot  guarantee  to  find  the  global  maximum.  Besides, 
the  weights  for  the  penalty  functions  often  need  be  manually  tuned,  depending  on  applications.  Al- 
tendorf  et  al.  [1]  describes  a  method  to  incorporate  monotonicity  constraints  into  learning  algorithm. 
It  is  based  on  the  assumption  that  the  values  of  the  variables  can  be  totally  ordered.  Additionally,  it 
also  uses  the  penalty  functions,  which  suffers  from  the  same  problem  as  [12].  Feelders  and  Van  der 
Gaag  [6]  incorporate  some  simple  inequality  constraints  in  the  learning  process.  They  assume  that 
all  the  variables  are  binary.  The  constraints  used  in  the  above  methods  [1,  12,  6]  are  restrictive,  as 
each  constraint  has  to  involve  all  parameters  in  a  conditional  probability  table  (CPT). 

Campos  and  Cozman  [2]  formulate  the  learning  problem  as  a  constrained  optimization  problem. 
However,  they  do  not  provide  a  specific  method  to  solve  the  optimization  problem.  Niculescu  et 
al.  [11]  also  solve  the  learning  problem  by  optimization  techniques.  They  derive  the  closed  form 
solutions  with  ML  estimation  for  two  kinds  of  constraints:  inequalities  between  sums  of  parameters 
and  upper  bounds  on  sum  of  parameters  within  a  CPT.  There  are  two  main  limitations  of  their 
method:  First,  they  assume  one  parameter  can  and  only  can  have  one  constraint,  and  there  is  no 
overlap  between  parameters  of  different  constraints.  Second,  their  method  cannot  handle  constraints 
from  different  CPTs.  We  improve  their  method  by  deriving  the  closed  form  solution  for  range 
constraints,  which  contain  both  upper  bound  and  lower  bound  constraints  for  the  same  parameters. 
In  addition,  the  relationship  constraints  defined  in  our  paper  can  either  be  within  or  between  CPTs. 


3  Problem  Definition  and  Approach 

3.1  Basic  Parameter  Learning  Theory 

We  focus  on  parameter  learning  in  a  Bayesian  Networks  with  all  discrete  nodes,  where  the  structure 
is  known  in  advance.  The  method  can  be  extended  to  other  graphical  models  including  the  IDs.  The 
notations  are  defined  as  follows.  Assume  a  BN  with  n  nodes,  0  is  the  entire  vector  of  parameters, 
and  Oijk  denotes  one  of  the  parameters.  9t.jk  =  p(x1l\pa3i ),  where  i  (i  =  1  ranges  over 

all  the  variables  in  the  BN,  j  (j  =  1, ...,  q,)  ranges  over  all  the  possible  parent  configurations  of 
node  (variable)  X^,  and  k  (k  =  1,  ...,ri)  ranges  over  all  the  possible  states  of  X.;.  Therefore,  x * 
represents  the  kth  state  of  node  X,,  and  pa\  is  the  jth  parent  configuration  of  node  X , . 

Given  a  dataset  D  =  {D i, ...,  Djv},  which  consists  of  samples  of  the  BN  nodes,  the  goal  of 
parameter  learning  is  to  find  the  most  probable  values  6  for  9  that  can  best  explain  the  dataset  D, 
which  is  usually  quantified  by  the  log  likelihood  function  log(p(D|0)),  denoted  as  L  d(9).  Assuming 
that  the  examples  are  drawn  independently  from  the  underlying  distribution,  based  on  the  conditional 
independence  assumptions  in  BNs,  we  have  the  log  likelihood  function  in  Eq.(l),  where  n  ijk  is  the 
count  for  the  case  that  node  i  has  the  state  with  the  state  configuration  j  for  its  parent  nodes. 


Ld(9)  =  log  nnn* 

i—lj—l k—1 


i  'R’ijk 

ijk 


(1) 
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If  the  dataset  D  is  complete,  ML  estimation  method  can  be  described  as  a  constrained  optimiza¬ 
tion  problem,  i.e.  maximize  (Eq.(2)),  subject  to  n  equality  constrains  (Eq.(3)). 


Max  LD{9)  (2) 

S  T.  9ij(0)  =  El-Li  -  1  =  0  (3) 

where  g,:i  imposes  the  constraint  that  each  parameter  sums  to  1  over  all  its  state,  1  <  i  <  n  and  1  < 

3  <  <H- 

If  dataset  D  is  incomplete,  ML  estimation  cannot  be  applied  directly.  A  common  method  is 
standard  EM  algorithm  [3],  which  starts  from  some  initial  point,  and  then  iteratively  takes  E  step 
and  M  step  to  get  a  local  maximum  of  the  likelihood  function.  Particularly  for  discrete  nodes, 
E  step  computes  the  expected  counts  for  all  parameters,  and  M  step  estimates  the  parameters  by 
maximizing  log  likelihood  function,  given  the  counts  from  E  step.  EM  algorithm  can  guarantee  to 
converge  to  a  local  maximum.  However,  depending  on  different  initializations,  it  may  converge  to 
different  local  maxima.  When  there  are  a  large  number  of  missing  data,  which  means  there  are  many 
local  maxima,  EM  algorithm  can  get  stuck  in  a  local  maximum  far  away  from  the  global  one. 

3.2  Qualitative  Constraints 

We  introduce  two  kinds  of  qualitative  constraints,  which  can  be  easily  specified  by  domain  experts. 
They  are  range  and  relationship  constraints. 

Range  constraint  defines  the  upper  bound  and  lower  bound  of  some  parameters.  Assuming  a  ijk 
and  /3ijk  are  the  upper  bound  and  lower  bound  for  parameter  9  ijk ,  then  the  range  constraints  can  be 
defined  as  follows: 

Pijk  —  Oijk  —  tXijk  (4) 

where  0  <  a,,t  <  1  and  0  <  /3yfc  <  1 

Relationship  constraint  defines  the  relative  relationship  between  a  pair  of  parameters.  If  both  of 
the  two  parameters  in  a  relationship  constraint  share  the  same  node  index  i,  and  parent  configuration 
j,  the  constraint  is  called  intra-relationship  constraint ,  which  can  be  represented  as  follows: 

Oijk  <  Oijk >  where  k  ^  k'  (5) 

If  the  two  parameters  in  a  relative  relationship  constraint  do  not  satisfy  the  requirement  of  an  intra¬ 
relationship  constraint,  the  constraint  is  called  inter-relationship  constraint.  It  can  be  described  as 
follows: 

Oijk  <  Oi'j'k'  where  i  ±  i'  or  j  j'  (6) 


3.3  Overview  of  Our  Approach 

We  aim  to  solve  the  learning  problem  by  reformulating  the  problem  as  a  constrained  based  opti¬ 
mization  problem,  i.e.. 

Max  Ld(0)  (7) 

S.T.  gij(O)  =  YJk=  i  Oijk  -1  =  0,  1  <  i  <  n,  and  1  <  j  < 
hp(0)  <0,  1  <  p  <  S 

where  hp(x)  <  0  denotes  the  inequality  constraints,  and  S  is  the  total  number  of  inequality  con¬ 
straints.  Using  the  Lagrange  multipliers  Ay  and  pp,  the  objective  function  to  be  maximized  can  be 
incorporated  with  the  constraints,  producing  the  following  augmented  objective  function 

n  qi  S 

m  =  Ld(0)  —  ^ llphp{0)  (8) 

i= 1  j=l  k= 1 

Given  Eq.(8),  for  sparse  but  complete  data,  we  can  directly  apply  the  CML  method  by  maxi¬ 
mizing  Eq.(8)  to  estimate  the  parameters.  For  incomplete  data,  we  can  replace  the  M  step  of  EM 
algorithm  by  the  solution  to  Eq.(8),  and  iteratively  obtain  the  estimation  of  the  parameters.  In  the 
section  to  follow,  we  introduce  our  solution  to  Eq.(8). 
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4  Parameter  Learning  With  Qualitative  Constraints 


In  this  section,  we  derive  the  closed  form  solutions  for  maximizing  Eq.(8)  under  different  types 
of  constraints.  Because  of  the  decomposability  of  the  log  likelihood  function,  we  can  deal  with 
small  independent  optimization  subproblems  on  independent  parameter  sets  separately  instead  of 
dealing  with  all  parameters  simultaneously.  For  this,  we  define  two  kinds  of  parameter  sets:  one  is 
the  baseline  set,  which  contains  parameters  with  the  same  node  and  the  same  parent  configuration; 
the  other  is  the  combined  set ,  which  contains  several  baseline  sets.  We  first  separate  parameters  into 
baseline  sets,  and  then  if  there  is  a  constraint  on  parameters  from  different  baseline  sets,  we  combine 
those  baseline  sets  into  one  new  combined  set.  This  process  continues  until  there  is  no  constraint 
on  parameters  from  different  sets.  After  decomposition  of  parameters,  we  solve  the  constrained 
optimization  subproblems  set  by  set  independently. 

Specifically,  let  Q  denote  a  parameter  set.  Since  parameters  from  one  baseline  set  share  the 
same  node  i  and  the  same  parent  configuration  j,  we  use  (i.  j)  to  denote  the  index  of  a  baseline 
set.  A  baseline  set  can  be  denoted  as  Q  =  {(*,  j)},  while  a  combined  set,  which  consists  of  several 
baseline  sets,  can  be  denoted  as  Q  =  {(i,j),  ( i',j '), ...}. 

The  parameter  learning  problem  can  be  decomposed  into  subproblems,  one  for  each  set  of 
parameters.  A  subproblem  can  be  formulated  as  follows: 

Max  lD(d)  =  logn(i>i)eQn:>=iCr 

S  T.  9ij(0)  =  EfcU Oijk  -  1  =  0  for  (i,j)  e  Q 

hp(9)  <  0  for  1  <  p  <  Sq  (9) 

where  (ji3  represents  an  equality  constraint,  hp  represents  an  inequality  constraint,  Sq  is  the  number 
of  inequality  constraints  in  set  Q. 


Since  the  log  likelihood  function  is  concave,  and  the  qualitative  constraints  are  linear,  Karush- 
Kuhn-Tucker  (KKT)  conditions  [10]  become  sufficient  to  determine  the  solution  to  Eq.(9).  The 
KKT  conditions  for  the  problem  described  in  Eq.(9)  are: 


Sq 

(0)- 

E  ^i,j9ij(9) 

Hphp(9)\ 

( i,j)eQ 

p= 

L 

9ij  (9)  =  0, 

for 

(i 

,j)  e  Q 

hp(9)  <  0, 

for 

1  < 

P<  Sq 

9p  :  0; 

for 

1  < 

P<  Sq 

9p  : 

*  hp{9)  =  0, 

for 

1  < 

P<  Sq 

(10) 


(11) 


In  optimization,  an  inequality  constraint  hp  <  0  is  active  if  hp  =  0,  or  inactive  if  hp  <  0.  Based  on 
this  definition,  we  will  derive  closed  form  solutions  for  each  type  of  constraints. 


4.1  Range  Constraints 

Since  range  constraints  (Eq.(4))  are  applied  to  every  individual  parameters,  we  can  solve  the  sub¬ 
problems  with  range  constraints  within  baseline  sets.  There  are  two  constraints  for  each  parameter 
Oijk  in  a  baseline  set  Q  =  {(i,j)}-  h^(9)  =  O^k  —  oiijk  <  0  (upper  bound  constraint),  and 
hu  (9)  =  flijk  —  9ijk  <  0  (lower  bound  constraint). 

As  the  objective  function  is  concave  and  the  range  constraints  are  linear,  the  maximum  solution 
either  lies  inside  the  feasible  region  defined  by  all  constraints,  when  no  constraint  is  active,  or  on 
the  boundary  defined  by  the  active  constrains,  when  some  of  the  constraints  are  active.  Assuming 
Kq  and  Kq  are  the  sets  of  active  constraints  for  lower  bound  and  upper  bound  of  parameters  in  Q 
respectively,  and  Kq  =  Kq  U  Kq  represents  the  set  for  all  active  constraints  of  parameters  in  (), 
then  the  closed  form  solution  for  9v]k  is  as  follows: 

Pijk  if  k  G  I<q 

if  k  G  Kq 

Pijk-  V  aijk)  ^  niJfc -  otherwise 

EHKQ^jk 


@ijk  — 


a-  E 
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Table  1 :  Search  algorithm  for  finding  active  range  constraints 

Step  1:  Check  the  consistency  of  the  range  constraints: 

0  CXijk  C  1  ?  b  '  fiijk  ^  l-i  (%ijk  ^  Pijki 
£fe=i  Pijk  <  1,  and  l  “yfc  >  1  for  1  <  k  < 
ri.  If  satisfied,  continue;  else  change  constraints. 

Step  2:  If  5^fc*=i  aijk  =  1-  all  the  upper  bound  constraints 
should  be  active;  else  if  Y^k= i  flijk  =  1,  all  the 
lower  bound  constraints  should  be  active.  Else, 
continue. 

Step  3;  Perform  the  ML  estimation  of  parameters  without 
constraints.  Check  the  constraints  with  the  esti¬ 
mated  parameters  9*-k  =  If  no  constraint 

is  violated,  then  there  is  no  active  range  constraint. 
Else,  continue. 

Step  4;  List  all  possible  combinations  of  active  constraints, 
and  remove  the  combination  if  it  contains  more 
than  ri  —  1  active  constraints  or  Pijk  + 

aijk  >  1- 

Step  5;  For  each  of  the  remaining  combination,  compute 
A ij,  until  finding  a  Ay  satisfying  the  criteria  in 
_ Eq.(13). _ 


The  derivation  is  as  follows.  From  the  first  equation  of  KKT  conditions  (Eq.(ll)),  we  obtain 
Oijk  =  - — -■  Because  9ijk  cannot  be  greater  than  atjk  and  less  than  /3yfc  at  the  same  time,  at 

most  one  of  the  upper  bound  constraint  h  k  and  lower  bound  constraint  hk  for  a  parameter  9,jk  can 
be  active  at  a  time.  Based  on  whether  there  is  an  active  constraint  for  9  ijk ,  two  cases  are  considered. 


•  Case  1:  If  one  of  the  upper  bound  and  lower  bound  constraints  is  active,  then  9  ijk  = 
when  hk(9)  =  0;  and  9^k  =  0ijk>  when  hk (9)  =  0. 


Case  2:  If  range  constraints  are  not  active,  then  hk(9)  <  0,  ht  (9)  <  0  and  pk  =  pk  =  0. 
Hence  9ak  =  Summing  up  over  all  parameters  whose  constraints  are  not  active, 

J  *ij 


we  get:  (1  Pijk  Sfceifg  aijk)  —  Y^k<£KQ  — 


YhkiKQ  ni*k 


Thus, 


.  .  ,  ^2ic(k0  n*ik 

we  can  obtain  Ay  =  — ^ s - 


,  and  9ijk  = 


Ot-ij  k  ) 


Tlijlc 


Q  Q 

as  shown  in  Eq.(12). 


(1  Yhk^K^  Pijk 


k(KQ 


In  this  way,  we  derive  the  closed  form  solution  for  range  constraints.  To  obtain  solution  in  Eq.(12), 
we  need  to  identify  active  constraints.  Table  1  summarizes  the  algorithm  to  find  active  range  con¬ 
straints.  The  main  idea  of  this  algorithm  is  to  search  for  the  active  constraints  using  the  criteria  in 
Eq.(13).  Due  to  the  page  limit,  we  do  not  provide  the  proof  for  this  equation. 


A  a  A 


Ay  <  ^ 

3  &ijk 
\  \  njjk 

~  9 ijk 

nijk  \  .  .  <f 
<*ijk  ’  13  ~ 


Pijk 


keK% 

keK0Q 

otherwise 


(13) 


4.2  Intra-Relationship  Constraints 

An  Intra-relationship  constraint  defines  the  relationship  between  two  parameters  within  one  baseline 
set.  Assuming  parameters  within  one  baseline  set  Q  =  {(i,j)}  are  #yi,  ...9ijri,  which  can  be 
partitioned  into  Q  =  A  U  B  U  C,  where  A  =  {ap\p  =  1,  2, ...,  Sq},  B  =  {bp\p  =  1, 2, ...,  Sq}, 
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such  that  hp(0)  =  Oijap  —  Oijbp  <  0,  for  1  <  p  <  Sq,  and  C  is  the  set  of  parameters  without 
intra-relationship  constraints,  the  closed  form  solution  for  parameter  9  ijk is  as  follows: 


”ijk 


nijap  ~\~nijbp 


2  Ni, 


Tljj  k 
Na 


if  k  —  dp  or  bp  and  nijap  >  nijbp 
Otherwise 


(14) 


where  Ntj  =  l  nijk •  The  derivation  is  similar  to  the  one  in  Niculescu  et  al.  [11], 


4.3  Inter-Relationship  Constraints 


An  Inter-relationship  constraint  defines  the  constraint  applied  on  two  parameters  0  +a  and  9i"j"b 
from  different  baseline  sets  Q  4  and  Q  B ,  thus  the  subproblem  for  parameters  with  an  inter¬ 
relationship  constraint  is  applied  on  a  combined  parameter  set  Q  =  Q  4  LJ  Q  /j ,  where  baseline 
set  Qa  =  {{i' ,j')}  and  baseline  set  Qb  =  {(i",  j")},  such  that  h{9)  =  0-i/j'a  —  9i"j"b  <  0.  Let 
Na  —  j\£=ri A  ‘ft'ijki  Nq  —  ‘biijki  na  —  n y ai  and  nb  —  rii'j'i).  The  closed  form 

solution  for  parameters  with  inter-relationship  constraint  is  as  follows.  If  naNB  —  NAnb  >  0 


Else 


'ijk  — 


n.+n-t 

Na+Nb 

f  1  _  n„+ni,  \  "i.jfr 
I1  Na  +  Nb  >  N A  —  na 

{ 1  _  Kg  \  nijk 

I1  Na+Nb  >  NB-nb 


ijk  =  i'j'a  or  i" j"b 
(i,j)  G  Qa  and  k  ^  a 
(i,j)  &  Qb  and  k  ^  b 
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ijk  — 


^  (i,j)€QB 


(15) 


(16) 


The  brief  derivation  of  the  solution  is  as  follows.  The  KKT  conditions  are: 


V0M0)  -  A AgA{0)  -  A b9b{0)  -  nh[9)\  =  0  (17) 


9a(0)  =  0  gB(0)  =  0 

h{0)  <0  fi  >  0  fi*  h{9)  =  0 

From  the  first  equation  of  KKT  conditions  (Eq.(18)),  we  can  obtain: 
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ijk  — 


Tlijk 

Ab  —  ju 
‘kl'i  j  k 

\a 

ij  k 

Ab 


ijk  —  i'j'a 

ijk  =  i"  j"b 

(i,j)  £  Qa  and  k  ^  a 

(i,j)  £  Qb  and  k  ^  b 


(18) 


(19) 


Two  cases  are  considered,  depending  on  whether  the  inter-relationship  constraint  is  active  or  not: 


•  Case  1  :h(9)  =  0  and  fi>  0 

We  can  solve  A^t,  As,  and  /j  with  the  following  equations: 


na  _  rib  _  Tiq+nb 

A  a+M  A  b— M  Aa+Ab 

n„  1  Na  —n„.  i 

Aa+M  Aa 

rih  1  Nn—nh  1 

+  Afl  -  1 


Ab  — M 


(20) 


The  first  equation  is  h(6)  =  0,  the  second  and  the  third  are  from  g A(0)  =  0,  and  gsiO)  = 
0.  Also,  from  /r  >  0,  we  can  get  naNs  —  NAnb  >  0.  In  this  way,  we  obtain  the  first  part 
of  closed  form  solution  (Eq.(16)). 

•  Case  2:  h(9)  <  0  and  g  =  0 

It  is  equivalent  to  the  case  that  no  inequality  constraints  are  applied.  From  g  a(9)  =  0,  we 
can  get  XA  =  J2{i,j)eQA  rHifc  =  ^ A •  Similarly,  we  can  get  A b  =  NB.  Plug  them  into 
Eq.(18),  we  can  obtain  the  second  part  of  closed  form  solution(Eq.(16)).  From  h  (0)  <  0, 

we  get  naNB  -  NAnb  <  0. 
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5  Evaluation  with  Synthetic  Data 


In  order  to  test  the  performance  of  our  method  against  ML  estimation  and  the  standard  EM  algorithm 
given  sparse  data  and  incomplete  data  respectively,  we  test  the  algorithms  on  multiple  BNs  with  the 
same  number  of  nodes  of  20,  but  different  randomly  generated  initial  parameters  and  structures.  For 
one  specific  BN  structure,  1 1  BNs  with  different  initializations  of  parameters  are  generated.  One  of 
them  is  treated  as  the  ground  truth,  and  10  others  as  different  initializations  for  parameter  learning. 
For  the  case  of  sparse  data,  700  samples  are  generated  from  the  ground  truth  BN,  200  for  testing  data 
and  the  remaining  500  for  training.  For  the  case  of  incomplete  data,  400  samples  are  drawn  from  the 
ground  truth  BN,  half  for  training,  half  for  testing,  and  all  training  data  associated  with  hidden  nodes 
are  removed.  To  produce  the  needed  constraints,  for  the  case  of  sparse  data,  we  randomly  choose 
a  subset  of  parameters  from  all  parameters,  and  impose  constraints  on  the  selected  parameters.  For 
the  case  of  incomplete  data,  we  randomly  choose  parameters  from  only  parameters  for  the  hidden 
nodes,  and  impose  constraints  on  them.  The  number  of  constraints  in  a  CPT  is  no  more  than  2.  For 
performance  characterization,  the  Kullback-Leibler  (K-L)  divergence  is  used,  which  measures  the 
distance  between  the  learned  parameters  and  the  ground  truth. 

With  complete  but  sparse  data,  we  compare  the  learning  performance  of  ML  estimation  with 
our  method  with  range  constraints,  intra-relationship  constraints  and  inter-relationship  constraints 
respectively,  as  shown  in  Figure  1.  We  can  see  that  CML  is  better  than  ML  estimation  in  both 
mean  and  standard  deviation  of  KL-divergence.  More  specifically,  the  mean  K-L  divergence  for  ML 
estimation  is  0.2087,  which  decreases  to  0.0786  for  CML  with  range  constraints,  0.1763  for  CML 
with  intra-relationship  constraints,  and  0.1546  for  CML  with  inter-relationship  constraints. 


Figure  1 :  Sparse  Data  Learning  Results  Comparisons  w.r.t  K-L  divergence:  ML  estimation  vs.  CML. 
(a)  range  constraints;  (b)  intra-relationship  constraints;  (c)  inter-relationship  constraints. 


With  incomplete  data,  we  compare  the  learning  performance  of  our  method  with  standard  EM 
method  as  shown  in  Figure  2.  The  average  K-L  divergence  of  hidden  nodes  decreases  from  0.6437 
for  EM  to  0.2361  for  CEM  with  range  constraints,  0.3830  for  CEM  with  intra-relationship  con¬ 
straints,  and  0.4864  for  CEM  with  inter-relationship  constraints.  The  improvements  are  especially 
significant  for  the  hidden  nodes  (nodes  13  to  20). 


Range  Constraints  Intra  Relationship  Constraints  Inter  Relationship  Constraints 


Figure  2:  Incomplete  Data  Learning  Results  Comparisons  using  w.r.t  K-L  divergence:  EM  vs.  CEM. 
(a)  range  constraints;  (b)  intra-relationship  constraints;  (c)  inter-relationship  constraints. 
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6  Conclusion 


Qualitative  domain  knowledge  generally  exists  in  applications.  We  define  two  types  of  constraints 
to  represent  the  qualitative  domain  knowledge,  and  derive  closed  form  solution  for  the  maximum 
likelihood  parameter  estimator  with  the  two  types  of  constraints  respectively.  For  the  case  of  sparse 
data,  we  directly  apply  our  constrained  maximum  likelihood  estimator,  while  for  incomplete  data, 
we  extend  EM  method  by  replacing  M  step  with  our  constrained  maximum  likelihood  estimator.  The 
experimental  results  from  synthetic  data  demonstrate  that  our  method  can  fully  exploit  the  domain 
knowledge  to  improve  parameter  learning  accuracy. 
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